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suppressing  noise  which  has  been  acoustically  added  to 


To  demonstrate  that  through  the  incorporation  of  the 
noise  suppression  methods,  speech  can  be  effectively 
analysed  for  narrow  band  digital  transmission  in  practical 


operating  environments 
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A SPECTRAL  SUBTRACTION  ALGORITHM  FOR 
SUPPRESSION  OF  ACOUSTIC  NOISE  IN  SPEECH 


I « 


Steven  F.  Boll 


Abstract 


Spectral  subtraction  has  been  shown  to  be  an  effective 

t 

approach  for  reducing  ambient  acoustic  noise  in  order  to 
improve  the  intelligibility  and  quality  of  digitally 
compressed  speech.  This  paper  presents  a set  of 
implementation  specifications  to  improve  algorithm 
performance  and  minimize  algorithm  computation  and  memory 
requirements.  It  is  shown  spectral  subtraction  can  be 
implemented  in  terms  of  a nonstationary r multiplicative, 
frequency  domain  filter  which  changes  with  the  time  varying 
spectral  characteristics  of  the  speech.  Using  this  filter  a 
speech  activity  detector  is  defined  and  used  to  allow  the 
algorithm  to  adapt  automatically  to  changing  ambient  noise 
environments.  Also  the  bandwidth  information  of  this  filter 
is  used  to  further  reduce  the  residual  narrowband  noise 
components  which  remain  after  spectral  subtraction. 


RANK-ORDER  SPEECH  CLASSIFICATION  ALGORITHM 


(RASCAL) 


Ben  Cox 

L,  K.  Timothy 
Abstract 

This  paper  describes  a theoretical  and  experimental 
investigation  for  detecting  the  presence  of  speech  in  wide 
band  noise.  A robust  algorithm  for  making  the 
silence-voiced-unvoiced  decision  is  described.  This 
algorithm  is  based  on  a nonparametr  ic  rank-order  statisstical 
signal-detection  scheme  that  does  not  require  a training  set 
of  data  and  maintains  a constant  false-alarm  rate  for  a 
broad  class  of  noise  inputs  corresponding  to  a single 
decision  threshold.  The  nonparametr ic  rank-order  decision 
procedure  is  the  multiple  use  of  the  two-sample  Savage  T 
statistic.  The  performance  of  this  detector  is  evaluated 
and  compared  to  that  obtained  by  manually  classifying  twenty 
recorded  utterances  with  39,  30,  20,  10,  and  0 decibel 
signal-to-noisv.  ratios.  In  limited  testing,  the  average 


probability  of  misclassif ication  is  less  that  5 percent,  12 
percent,  and  55  percent  for  signal-to-noise  ratios  of  39, 
20,  and  0 decibels  respectively. 


ESTIMATING  THE  PARAMETERS  OF  A NOISY  ALL-POLE  PROCESS 
USING  POLE-ZERO  MODELING 


W.  J.  DONE 
C.  K.  RUSHFORTH 

Abstract 

Linear  predictive  coding  (LPC)  has  been  successfully 
applied  to  the  encoding  of  speech  and  other  time  series.  It 
has  been  widely  observed,  however,  that  the  performance  of 
an  LPC  algorithm  deteriorates  rapidly  in  the  presence  of 
background  noise.  In  this  paper,  we  describe  and  discuss 
one  approach  to  the  identification  of  a time  series 
corrupted  by  additive  white  noise. 

A common  approach  to  this  problem  is  to  prefilter  the 
noisy  time  series,  and  then  to  apply  an  estimation  algorithm 
f which  treats  the  time  series  as  if  it  were  noise-free.  We 

describe  an  alternative  approach  which  involves  modifying 


the  time-series 

model 

at  the  outset  to  the  account 

for 

the 

presence  of 

noise 

. An  estimation 

algorithm 

is 

then 

developed  for 

this 

modified  model. 

We  discuss 

the 

development  of 

the  model,  the  estimation 

algorithm. 

and 

some 

) representative  experimental  results. 
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This  abstract  is  taken  from  the  Ph.D.  dissertation  of 


W.J.  Done  cntilted#  "Estimation  of  the  Parameters  of  an 
Autoregressive  Process  in  the  Presence  of  Additive  White 
Noise."  This  diss-'r tation  has  been  published  as  technical 
report  No.  OTEC-CSc-79-021. 


EVALUATION  OF  THE  STEIGLITZ  ALGORITHM  FOR  ESTIMATING 


THE  PARAMETERS  OF  AN  ARMA  PROCESS 


W.  J.  Done 
C.  K.  Rushforth 


Abstract 

Steiglitz  has  recently  described  an  algorithm  for 
estimating  the  parameters  of  an 
autoregressive-moving-average  (ARMA)  process.  This 
algorithm  has  application,  for  example,  to  the  problem  of 
determining  the  poles  and  zeros  of  the  vocal-tract  transfer 
function. 

In  this  paper,  we  report  and  discuss  the  results  of  a 
number  of  simulations  conducted  using  the  Steiglitz 
algorithm.  The  bulk  of  the  experiments  involved  driving  the 
ten-pole,  two-zero  filter  described  in  (2)  with  a single 
pulse,  with  a short  pulse  train,  and  with  samples  of  white 
Gaussian  noise.  In  each  of  these  cases,  we  evaluated  the 
effects  of  such  processing  options  as  windowing, 
pre^mphasis,  and  cepstral-domain  filtering.  We  also  discuss 
and  compare  results  obtained  by  applying  the  Steiglitz 
algorithm  and  a Newton-Raphson  conditional 
maximum-likelihood  algorithm  to  a first-order  process. 


This  abstract  is  taken  from  the  Ph.D.  dissertation  of 


W.J.  Done  entitled r "Estimation  of  the  Parameters  of  an 
Autoregressive  Process  in  the  Presence  of  Additive  White 
Noise."  This  dissertation  has  been  published  as  technical 
report.  No.  UTEC-CSc-79-021. 
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RATE/PITCH  MODIFICATION 
USING  THE  CONSTANT-Q  TRANSFORM 

1 

James  E.  Youngberg 

Abstract 

Modification  of  the  rate  of  occurrence  of  acoustic 
events  without  altering  frequency  content,  and  modification 
of  pitch  without  changing  time  scale  are  presented  as 

i 

equivalent  problems.  While  the  short>time  Fourier  transform 
has  been  used  to  solve  the  rate  modification  problem,  it  is 
not  a natural  tool.  It  lacks  the  scaling  property  of  the 
Fourier  transform.  The  Constant-Q  transform,  on  the  other 
hand,  exhibits  this  properly.  A more  natural  rate/pitch 

modification  system  using  the  Constant-Q  transform  is  ^ 

1 

! 

presented  which  performs  well  with  rate/pitch  changes  by  \ 

factors  of  between  one-third  and  three. 
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A SPECTRAL  SUBTRACTION  ALGORITHM  FOR 
SUPPRESSION  OF  ACOUSTIC  NOISE  IN  SPEECH 


Steven  F.  Boll 


To  be  presented  at  ICASSP-79 
April  2-4,  1979 
Washington,  D.C. 


Spectral  subtraction  has  been  shown  to  be  an 
effective  approach  for  reducing  ambient  acoustic  noise 
in  order  to  improve  the  intelligibility  and  quality  of 
digitally  compressed  speech.  This  paper  presents  a set 
of  implementation  specifications  to  improve  algorithm 
performance  and  minimize  algorithm  computation  and 
memory  requirements.  It  is  shown  spectral  subtraction 
can  be  implemented  in  terms  of  a nonstationary » 
multiplicative,  frequency  domain  filter  which  changes 
with  the  time  varying  spectral  characteristics  of  the 
speech.  Using  this  filter  a speech  activity  detector 
is  defined  and  used  to  allow  the  algorithm  to  adapt 
automatically  to  changing  ambient  noise  environments. 
Also  the  bandwidth  information  of  this  filter  is  used 
to  further  reduce  the  residual  narrowband  noise 
components  which  remain  after  spectral  subtraction. 
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Introduction 


I 


I 


Digital  speech  compression  systems  operating  in 
environments  with  high  ambient  acoustic  noise  may 
require  additional  noise  suppression  to  process  speech 
having  acceptable  intelligibility  and  quality  11]. 
Previous  results  to  suppress  noise  using  the  spectral 
subtraction  approach  have  demonstrated  quantitative 
improvements  in  quality  and  intelligibility  [2],  [3]. 
This  paper  describes  a number  of  techniques  for 
improving  the  efficiency  and  effectiveness  of  this 
approach.  It  is  shown  that  the  algorithm  can  be 
implemented  in  terms  of  a nonstationary, 
multiplicative,  frequency  domain  filter. 
Characteristics  of  this  filter  provide  information  for 
further  reduction  of  spectral  error  and  detection  of 
speech  activity.  In  addition  techniques  are  presented 
for  increasing  algorithm  efficiency,  decreasing  memory 
requirements,  decreasing  processing  delay,  and 
simplifying  requirements  for  interfacing  the  noise 
suppressor  with  the  subsequent  speech  compression 
analyzer . 


Signal  Estimation  Using  Spectral  Subtraction 

Signal  x(i)  digitized  from  a single  microphone 
consists  of  the  sum  of  speech  Sp(i)  and  ambient 
acoustic  noise  n(i).  It  is  assumed  that  the  noise  is 


i 


] 
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i 
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locally  stationary  to  the  extent  that  average  value  of 


its  spectral  magnitude  during  speech  activity  is  equal 
to  that  measured  just  prior  to  speech  activity.  Using 
these  assumptions  the  spectral  subtraction  algorithm 
attempts  to  suppress  the  additive  acoustic  noise 
component  n(i)  from  x(i)  by  the  following  steps: 


1.  Segment  the  noisy  data  into  windowed  analysis 
blocks  of  length  M samples,  x (i) , i=0, 1. . . ,M-1. 

2.  Compute  the  N point  DFT  X(k)  of  data  x(i). 

3.  Estimate  the  speech  spectrum  S(k)  by 

subtracting  the  average  noise  spectral  magnitude,  B(k) 
■ ave|N(k)|,  calculated  during  non-speech  activity, 

from  I X(k) I : 

S(k)  - ( |X(k) l-B(k) 1 exp(j  ARGlX(k)l)  k-0,1, . . . ,N-1 

The  motivation  behind  this  approach  is  to  subtract 
from  the  noisy  speech  spectrum,  an  estimate  of  the 
noise  spectrum  which  is  readily  available.  The 
I magnitude  of  N(k)  is  replaced  by  its  average  value, 

B(k),  and  the  phase  of  N(k)  is  replaced  by  the  phase  of 

j X(k). 

I The  spectral  error  using  this  approach  is  given  by 

I S(k)-Sp(k)  - N(k)-B(k)  exp(j  ARG[X(k)]) 
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A simple  method  for  reducing  this  error  is 
half-wave  rectification.  With  it  the  estimator  becomes 


S(k)  » {X(k)-B(k)  }exp(j  ARGtX(k)))  |X(k)|>B(k) 

0 |X(k)|<B(k) 

Multiplicative  Filter 

The  spectral  subtraction  estimator  can  be 
compactly  defined  using  a multiplicative  frequency 
filter , H (k) : 

H(k)  = (l-B(k)/|X(k) I ) (1/2  + 1/2  SGN( |X(k) l-B(k) ) ) 

The  speech  estimate  S(k)  is  then  given  by 
S (k) =H (k) X(k) . Examination  of  the  expression  for  H(k) 
shows  that  H(k)  = 0 when  |X(k)|<B(k),  (band  stop)  and 
H(k)~l  for  I X(k) I >>B(k) , (band  pass).  In  addition  an 
estimate  of  the  signal  to  noise  ratio  SNR  is  directly 
available  from  H(k)  at  each  frequency  bin  k; 

SNR(k)  = S(k)/B(k)  = H(k)/(1-H(k)  ) 


Residual  Noise  Suppression 

After  half-wave  rectification  speech  plus  noise 
above  B(k)  remains.  In  the  absence  of  speech  activity, 
the  noise  residual  N(k)-B(k)  exp  (j  ARG[n(k)])  will 
exhibit  itself  as  randomly  spaced  narrowband  spikes 
separated  by  intervals,  having  zero  magnitude.  The 
corresponding  frequency  filters  H(k)  will  have  the  same 
zero  magnitude  intervals.  Non-zero  amplitudes  will 
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have  values  given  by 

H(k)  - l-B(k)/|N(k)  I 

These  values,  being  deviations  of  the  noise 
magnitude  spectrum  above  its  mean  correspond  to  the 
noise  residual.  Assuming  the  noise  to  be  a zero  mean, 
Gaussian  process,  the  magnitude  spectrum  of  |N|  will 
have  a Rayleigh  distribution.  Using  this  information 
it  can  be  shown  that  less  than  1%  of  the  time  will  H(k) 
exceed  a value  of  0.6  (2.5  times  its  mean,  B(k))  when 
speech  is  absent.  This  suggests  that  the  noise 
residual  could  be  eliminated  99%  of  the  time  by  simply 
zeroing  all  spectral  components  which  corresponds  to 
values  of  H(k)  less  than  0.6.  However,  during  speech 
activity,  assuming  Gaussian  speech  and  a signal  to 
noise  ratio  of  lOdB,  H(k)  will  take  on  values  below 
0.6,  about  36%  of  the  time.  Thus  simply  rejecting  all 
spectra  X(k)  corresponding  to  H(k)  below  0.6  could  in 
some  instances  incorrectly  remove  low  energy  speech 
spectra. 

In  order  to  reduce  the  noise  residual  but  retain 
low  energy  speech  in  X(k),  a magnitude  plus  bandwidth 
measurement  test  is  used.  Sections  of  H(k)  having 
bandwidths  less  than  300Hz  and  amplitudes  less  than  0.6 
are  classified  as  being  due  only  to  noise.  Here 
bandwidth  is  defined  as  the  distance  between  successive 
frequency  bins  having  zero  amplitude.  The  300Hz  figure 
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was  empirically  determined  after  examining  an  ensemble 
of  subtractive  filter  frequency  responses  taken  during 


non-speech  activity  using  helicopter 
noise  only  sections  are  attenuated  by 
20dB. 


noise.  These 
an  additional 


This  secondary  noise  suppression  procedure  was 
applied  to  all  values  of  H(k)  above  800Hz.  Below  800Hz 
narrowband  harmonics  essential  to  accurate  pitch 
detection  can  be  present.  This  procedure  could 
incorrectly  attenuate  them  causing  pitch  tracking 
errors.  Therefore  in  this  frequency  region  only  bias 
removal  and  half-wave  rectification  is  employed.  The 
800Hz  value  was  picked  to  equal  the  cutoff  frequency  of 
the  low-pass  filter  applied  to  the  signal  prior  to  down 
sampling  for  SIFT  [4]  pitch  detection.  Figure  1 shows 
examples  of  subtractive  filters  and  corresponding 
magnitude  spectra  before  and  after  residual  noise 
reduction  computed  for  a frame  of  noise  only  signal. 
Figure  2 shows  examples  of  subtractive  filters  and 
corresponding  magnitude  spectra  before  and  after 
residual  noise  reduction  during  voiced  speech. 


1 
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Algorithm  Implementation 

The  task  of  spectral  subtraction  is  to  provide  the 
vocoder  analyzer  with  a buffer  of  noise  suppressed 
speech  in  a time  interval  which  is  not  only  less  than 
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the  buffer  length  time  but  which  is  also  short  enough 
to  allow  the  analyzer  to  compute  and  transmit  the 


I 


I 

I 

I 

i 

I 

I 

■ 

i 

I 


vocoder  channel  parameters.  This  interfacing 
constraint  imposes  certain  conditions  on  the 
implementation.  The  algorithm  should  use  the  same 
buffer  size  as  the  analyzer.  Assuming  a single 
processor  it  must  compute  the  noise  suppressed  speech 
in  the  time  left  over  after  the  analyzer  calculations. 
It  must  supply  the  processed  speech  with  minimum  delay. 
In  addition  to  the  basic  noise  suppression  procedures, 
it  must  monitor  the  signal  to  noise  environment  and 
update  the  average  noise  bias  spectrum  B(k)  if 
necessary. 

Data  Segmentation 

Buffer  lengths  of  speech  compression  analyzers 
come  in  all  sizes.  Matching  the  noise  suppression 
analysis  buffer  to  that  used  by  the  vocoder  results  in 
the  simplest  implementation.  This  approach,  however, 
leads  to  two  operational  compromises.  First,  if  the 
buffer  is  not  a power  of  two  then  zeros  must  be 
appended  before  transforming.  Second,  if  buffer 
lengths  are  to  be  matched,  with  minimum  delay,  then  no 
overlapping  (and  thus  no  windowing)  is  allowed.  The 
effect  of  padding  with  zeros  simply  means  lower 
efficiency  (fewer  points  processed  per  FFT) . It  has  a 
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positive  effect  of  reducing  the  amount  of  temporal 
aliasing  due  to  spectral  modification  [5].  No  overlap 
of  time  windows  doubles  the  processing  speed.  The 
possible  detrimental  effect  of  having  no  time  window 
consists  of  inducing  discontinuities  at  the  buffer 
boundaries.  Reconstituted  waveforms  from  successive 
analysis  buffers  will  not  necessarily  agree  at  the 
boundary.  In  fact,  in  listening  to  the  processed 


speech  entering 

the  vocoder. 

a low 

-level 

but  distinct 

clicking  sound 

can 

sometimes 

be 

heard 

having 

a 

frequency  equal 

to 

the  analysis 

frame 

rate . 

The 

clicking  is  due 

to 

waveform 

discontinuities  at 

the 

boundaries.  If 

the  data 

had 

been 

weighted 

by 

half -over  lapped  banning  windows,  the  discontinuity 
effect  could  be  minimized.  However,  since  the  speech 
is  to  be  further  processed  by  a compression  analyzer 
using  the  same  buffer  size,  the  discontinuities  do  not 
cause  noticeable  problems. 


Bidirectional  Biplexed  DPT 

Spectral  subtraction  requires  two  DPT's  to  be 
performed:  a forward  transform  of  the  noisy  signal 
x(i)  and  an  inverse  transform  of  the  noise  suppressed 
spectrum,  S (k) «X(k) H (k) . Armantrout  [6]  developed  a 
biplexed  DPT  which  simultaneously  computes  the  forward 
transform  of  x(i)  and  the  inverse  transform  of  S(k) 
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from  the  previous  frame. 


The  loading  procedure  is 


given  as 

RE(i)  « xo(i)  + SR(i)/N 
Im(i)  * xe(i)  - SI(i)/N 

where  xe(i)  » (x(i)  + x(N-i))/2,  even  part  of  «(i) 
xo{i)  = (x(i)  - x(N-i))/2,  odd  part  of  x(i) 
SR(i)  * Real  part  of  S(i) 

SI(i)  = Imaginary  part  of  S(i) 

N = DPT  size 

Let  C(k)  + jD(k)  = DPT  {RE(i)+jIM(i)  } 

Then 

s(k)  = C(k) 

Re{X(k)}  = (D(k)  + D(N-k))/2,  even  part  of  D(k) 
Im{X(k)}  = (D(k)  + D(N-k))/2,  odd  part  of  D(k) 
where 

s{k)  equals  the  inverse  DPT  of  S(k) 

Re{X(k)}  » Real  Part  of  X(k) 

Im{X(k)}  = Imaginary  part  of  X(k) 

In  addition,  the  even-odd  symmetries  of  the 
signals  can  be  used  to  reduce  the  storage  requirement 
in  half.  That  is,  the  even  part  of  the  signal  can  be 
stored  in  the  first  N/2+1  locations  and  the  odd  part  of 
the  signal  in  the  last  N/2-1  locations. 


Speech  Activity  Detection 


Effective  noise  suppression  requires  an  accurate 
estimate  of  the  averaqe  noise  bias,  B(k).  If  the 
ambient  noise  becomes  either  louder  or  softer,  the  bias 
should  be  updated  during  the  next  interval  of 
non -speech  activity. 

For  detecting  the  absence  of  speech  activity 
during  a stationary  noise  interval  and/or  detecting  a 
decrease  in  the  noise  bias,  the  estimated  signal  to 
noise  ratio: 

SNP  (k)=H(k)/{l-H(k)  )=S(k)/B(k) 
can  be  used.  Computing  the  average  SNR(k)  over  all 
frequency  bins  provides  a measure  the  relative  energy 
of  S to  B.  During  the  absence  of  speech  activity,  the 
SNR  was  found  to  be  less  than  -12dB  over  a wide  range 
of  noise  environments.  This  measure  also  can  detect 
when  the  ambient  noise  becomes  less.  In  this  instance 
more  values  of  X(k)  will  lie  below  B(k)and  thus  more 
values  of  H(k)  will  be  zero  driving  the  average  value 
down.  Thus  the  measure  H/(l-H)  averaged  over  all 
frequency  bins  compared  with  the  threshold  -12dB  was 
used  to  signal  speech  absence  and/or  noise  bias 
reduction . 


Noise  Bias  Increase  Detection 
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Detecting  when  the  average  noise  bias  has  become 
louder  presents  a more  difficult  problem  since  spectra 
above  the  noise  mean  is  assumed  to  be  speech.  As  the 
noise  increases  a larger  percentage  of  X(k)  lies  above 
B(k).  Thus  if  N(k)>>B(k)  then  H(k)“l.  This 
unfortunately  is  the  identical  situation  found  during  a 
high  signal  to  noise  ratio  environment.  The  measure 
that  is  needed  is  N(k)/B(k)  or  equivalently  X(k)/B(k) 
for  Sp(k)>0.  A procedure  used  to  obtain  N(k)/B(k)  was 
to  average  X(k)/B(k)  = 1/1-H  (k) )over  the  top  300Hz  of 
the  base  band.  If  this  average  was  greater  than  lOdB 
for  ten  consecutive  analysis  frames  then  the  noise  bias 
is  updated. 

Automatic  Operation 

Using  the  speech  activity  and  bias  increase 
detectors  the  spectral  subtraction  algorithm  will  run 
without  operator  intervention.  The  detectors  provide 
one  of  many  possible  schemes  for  adaptive  operation  in 
a changing  noise  environment.  Others  are  possible  and 
proper  procedures  for  correct  adaption  still  remain  a 
research  issue  probably  best  resolved  using  a real-time 
system  employed  in  actual  operating  environments. 

A block  diagram  showing  the  various  algorithm 
procedures  is  given  in  Figure  3. 
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Discussion 
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Omitting  the  windowing  and  half  overlapping 
simplifies  the  interface  requirements  with  the  follow 
on  vocoder  and  doubles  the  throughput  per  transform  of 
the  algorithm.  This  approach  induces  discontinuities 
at  the  boundaries  which  are  essentially  ignored  by  the 
speech  analyzer.  Using  the  bidirectional,  biplexed  DPT 
produces  only  one  frame  of  delay  and  takes  advantage  of 
the  symmetries  of  the  real  data  to  reduce  FFT 
computation  by  about  one-half.  Reduction  of  the 
residual  noise  left  after  subtraction  using  the 
amplitude-bandwidth  test  removes  the  majority  of  the 
noise  residual  while  retaining  wide  bandwidth,  low 
energy  speech.  However,  noise  spectral  components 
which  exceed  2.5  times  its  mean  or  with  bandwidths 
greater  than  300Hz  will  remain.  These  components,  due 
both  to  statistical  randomness  and  nonstationary, 
remain  due  to  their  resemblance  to  speech  spectra. 
Thus  the  algorithm  is  biased  towards  keeping  low  energy 
speech  and  high  energy  noise. 

A final  modification  to  the  multiplicative  filter 
to  suppress  the  acotfstic  effect  of  the  remaining  noise 
is  to  replace  the  zero  amplitude  frequency  bins  in  H 
with  a small  constant.  Using  0.1  instead  of  0.0  brings 
the  noise  floor  up,  insures  that  the  magnitude  spectrum 
is  now  everywhere  positive,  and  reinstates  the  natural 
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ambient  noise  environment  only  now  attenuated  by 
approximately  20dB. 


i 


It  should  be  apparent  that  as  the  signal  and  noise 
energies  become  equal  or  the  noise  becomes  highly 
nonstationary  this  algorithm  will  break  down.  Speech 
intelligibility  in  these  situations  can  be  improved 
using  noise  suppression  microphones  [1]  and/or  two 
microphone  adaptive  noise  cancellation  procedures  [7]. 
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Figure  2a:  Voiced  Speech,  Subtractive  Filters. 


Figure  2b:  Voiced  Speech,  Magnitude  Spectra. 


Multiply 

Spectrum 

S=HX 


Segment  Speech 


x(i) 


Even/Odd  Split 


xe (i)  xo  (i) 

! ; 1 

Bidirectional 
Biplexed  FFT 


s(i)  Output  Speech 


Figure  3 

System  Block  Diagram 


-27- 


REDUCTION  OF  NONSTATIONARY  ACOUSTIC  NOISE  IN 


SPEECH  USING  U<S  ADAPTIVE  NOISE  CANCELLING 


Craig  Rushforth 
E.E.  Dept. 
University  of  UtcHi 
Salt  Lake  City,  UT 


Steven  F.  Boll 
Con^uter  Science  Dpt 
University  of  Ut£Ui 
Salt  Lake  City,  UT 


LaMar  Timothy 
E.E.  Dept. 
Unversity  of  Utah 
Salt  Lake  City,  UT 


December  1978 


TO  be  presented  at  ICASSP-79 
April  2-4,  1979 
Washington,  D.C. 


REDUCTION  OF  NONSTATIONARY  ACOUSTIC  NOISE  IN 
SPEECH  USING  LMS  ADAPTIVE  NOISE  CANCELLING 


Dennis  Pulsipher  Steven  F.  Boll  Craig  Rushforth  LaMar  Timothy 


Sandia  Laboratory  University  of  University  of 

Utah  Utah 


Nonstationary  acoustic  noise  with, 
energy  possibly  equal  to  or  greater  than 
the  speech  it.  suppressed  using  a two 
■icrophone  ieplesientation  of  adaptive 
noise  cancellation.  The  primary  noise 
added  to  the  speech  is  reduced  by 
subtracting  a filtered  version  of  the 
second  microphone  reference  noise.  The 
reference  noise  filter  is  adaptively  up 
dated  using  the  Widrow-Hoff  LNS  algorithm 

(1].  The  effectiveness  of  noise, 
suppression  depends  directly  on  the 
ability  of  the  filter  to  estimate  the, 
transfer  function  relating  the  primary  andi 
reference  noise  channels.  A study  of  the 
filter  length  required  to  achieve  a 
desired  noise  reduction  level  in  a 
hard-walled  room  is  presented.  Results- 
demonstrating  noise  reduction  in  excess 
ItdB  in  an  environment  with  BdB  signal 
noise  ratio  are  presented. 


Introduction 


Let  us  assume  that  we  are  given  x(t), 
the  sun  of  two  mutually  uncorrelated 
signals,  s(t)  and  n(t),  and  a third  signal 
v(t),  which  is  mutually  uncorrelated  with 
s(t).  We  can  then  form  a signal  estimate 


(1)  »(t)  ■ x(t)-u(t)-s(t)-t(n(t)-u(t)  ] 


where  u(t)  is  a noise  estimate  which  we 
will  constrain  to  be  a linearly  filtered 
version  of  v(t),  (see  Figure  1). 
Minimising  the  mean  output  power  causes 
the  signal  estimate  s(t)  to  be  a least 
mean  squares  fit  to  the  signal  s(t).  The 
minimization,  of  course,  must  be  carried 
out  by  choosing  an  h(t)  (the  impulse 
response  of  the  filter  through  which  v(t) 
is  passed  to  generate  u(t))  which 
minimises  the  power  in  i(t).  We,  then, 
ate  looking  for  h(t)  which  satisfies! 


Min(B(i(t)2)j 

h(t) 


Block  Solution 


Let  V , s , s , etc.  be  the  value  of 
the  corresponding  signal  at  time  nT,  where 
T is  the  sampling  interval. 

Define  the  vectors 
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(2)V^-Iv(n)  ...v(n-L-H)  11^- (h  (1  ,n) . . .h  (L,n)  ] 


where  L is  the  length  of  the  filter  to  be 
estimated  and  H is  the  filter. 

Defining 


(3)  P - E(x^V„)  R - E{V„V„  ) 
n n n n 


yielding 

(4)  E(s"^)  - E{x^  ) - 2p’'h  + bFrh 


which  is  a quadratic  function  of  H.  By 
differentiating  with  respect  to  the 
elements  of  H we  get 


V - - 2P  2RH. 


Setting  V*0  to  find  the  optimal  H , we  get 
(6)  H*  - R"^P. 


The  block  solution  optimal  filter  was 
calculated  by  solving  equation  6.  The 
filters  were  calculated  using  a standard 
Levinson's  recursion  algorithm  [2]. 


Adaptive  Solution 


To  calculate  H adaptively  a standard' 
steepest  descent  algorithm  is  used: 


Vl  - «n  - ^’n 


where  the  parameter  p controls  convergence 
and  stability.  Since,  we  do  not  have 
access  to  V , we  use  a gradient  estimate. 


(8)  • — 2S|iVji 


which  yields  the  algorithm 


“n+l  ■ Hn  ♦ 2**«nVn- 


By  defining  the  expected  value  of  H 


as  M„  it  can  be  shown  that 


(IB)  NkII  -2pRf^-^R"^P-(I  -^R)"r"^. 


By  diagonaliting  R,  it  can  be  shown  that 

(11)  llm(Mj  - bT^P  for  B<n<  i 

n*-"  ^ m-x 


where  is  the  largest  eigenvalue  of 


n 


29 


1 


» 


i 
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the  BatrlK  R.  The  variance  of  the 
eatiaate  can  alao  be  forced  below  any 
arbitrary  positive  Halt  as  n gets  lar9e 
for  uncorrelated  with  Vj  for  k^j. 

The  optiaal  filter  is  a function  of 
the  inverse  of  R,  R~lp  (eq.  6).  If  R is 
sinqular  it  does  not  aean,  in  general » 
that  there  is  no  solution,  siaply  that  it 
is  not  unique.  This  condition  is 
frequently  encountered  when  the 
interfering  noise  is  periodic,  or  nearly 
periodic.  While  channel  estiaation  is  not 
coapletely  possible  in  such  cases,  it  is 
only  necessary  to  estimate  the  channel 
accurately  in  those  frequency  bands  where 
significant  Interfering  energy  is  present. 
Even  though  the  channel  estlaate  aay  be 
considered  poor  in  such  a situation,  the 
noise  reduction  achievable  aay  be 
significant. 

Data  Generation 

If  the  data  is  generated  as  shown  in 
figure  2,  and  if  the  channel  is  a finite 
length  all-sero  filter,  perfect  noise 
cancellation  can  be  achieved  if  the 
estiaated  linear  filter  , H , converges 
to  G . A aore  realistic  aodel  for  data 
generation  is  given  in  Figure  3.  The 
noise  cancellation  problea  ia  then  reduced 
to  estlaating  of  Gi-iG|  , see  Figure  4.  If 
C,  and  G,  can  be  aodel led  as  all-  tero 
filters,  the  difficulty  in  estlaating  the 
optiaal  filter  arises  because  of  the  need 
to  effectively  invert  Gi . In  general  Gj 
will  not  be  a 'ainiaua  phase*  process. 
Its  inverse  will,  therefore,  have  poles 
'outside  the  unit  circle.  For  the 
estiaated  optiaal  filter  to  be  stable  will 
require  it  to  be  noncausal  and  doubly 
infinite.  If  its  poles  are  well  away  froa 
the  unit  circle,  the  response  will  be 
doainated  by  rapidly  decaying 
exponentials.  This  allows  us  to 
approxiaate  the  required  doubly  infinite 
recursively  generated  filter,  with  a 
finite  transversal  filter.  As  the  teroes 
of  G,  and  the  actual  poles  of  g7' approach 
the  unit  circle,  however,  the  number  of 
points  which  we  must  allow  in  the  active 
interval  of  the  filter  to  be  estimated 
grows  if  we  desire  to  maintain  a constant 
error,  (31. 

Basic  Experiments 

A white  noise  generator  was  used  as  ^ 
primary  noise  source.  Its  output  wac 
low-pass  filtered  to  3.2  KHx  and  sampled 
at  a rate  of  6.67  KHx.  A square  wave 
generator  was  used  to  generate  neatly 
periodic  noise  saaple.  This  saaple  was 
aade  highly  non-stationary  by  varying  the 
frequency  adjustment  of  the  square-wave 
generator  in  a seai-randoa  fashion  while 
digitising.  These  samples  were  then 
concatenated  and  used  as  noise  sources  for 


both  synthetic  and  acoustically  recorded 
experiments. 

Four  FIR  channel  filters  were  used  in 
order  to  analyze  the  performance  of  the 
noise  cancellation.  A low-pass  filter 
with  its  cutoff  frequency  at  approximately 
1560  Hz  and  a triple  band-pass  filter  were 
created.  Two  room-channel  estimates  were 
made  from  actual  measurements  of  a room's 
tesponse  in  order  to  simulate,  digitally, 
an  actual  room  [4]. 

Synthetic  Experiments 

For  the  initial  experiments  digitally 
recorded  speech  signals  were  added  to  the 
channel  filtered  noise  segments  and  used 
as  the  noisy  signals  applied  to  the  AMC 
algorithm.  The  corresponding  unfiltered 
noise  segments  were  used  as  the  noise 
reference  input  signals.  When  white 
Gaussian  noise  was  used  as  the  interfering 
noise,  accurate  channel  estimates  were 
obtained, (H  converged  to  G2 ) for  both- 
low-pass  and  multi-band-pass  channels. 
When  the  highly  correlated,  nearly 
periodic  noise  samples  were  used,  the 
channel  estimates  did  not  converge  to  the 
known  channels,  but  essentially  complete 
noise  cancellation  still  occurred. 

Room  Simulations 

Using  the  measured  room  impulse 
responses,  the  degree  of  cancellation 
possible  in  a hard-walled  room  about 
fifteen  feet  square  was  determined.  In 
the  first  experiment,  the  original  noise 
signal  was  used  as  the  reference 
noise(G.Bl),  and  one  of  the  room  channel 
filtered  signals  was  used  as  the  noisy 
signal.  While  the  original  room  channel 
estimates  were  4096  points  long,  the 
adaptive  filter  was  constrained  to  a 
length  of  3000  points.  An  adaption 
time-constant  of  approximately  0.4  seconds 
was  specified.  Noise  reduction  of  -25  dB 
was  measured  for  this  experiment. 

In  the  second  experiment,  the 
reference  noise  was  generated  by 
convolving  the  white  noise  with  one  of  the 
room  transfer  functions, Q]  , while  the 
noise  added  to  the  speech  was  generated  by 
convolving  the  white  noise  through  the 
other  room  transfer  function, G,  . This 
data  model  corresponds  to  Figure  3.  Again 
3000  points  were  specified  for  the 
adaptive  filter's  length,  half  of  them 
before  t«0.  The  resulting  noise  reduction 
measured  was  -12  dB. 

Acoustically  Recorded  Experiments 

Two  similar  experiaents  were 
performed  in  an  actual  acoustic 
environment.  The  digitised  noise  sources 
were  played  through  a single  aulti-eleaent 
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BOSE  loudspeaker  and  digitally  recorded 
through  two  separate  SONY  ECN-27B 
aiicrophones  placed  at  the  same  locations 
in  the  hard-walled  room. 

First  a single  channel  room 
estimation  experiment  was  performed  . The 
reference  noise  was  picked  up  by  directly 
digitising  the  speaker  signal (G,"!) . The 
acoustically  added  speech  plus  noise 
signal  was  simultaneously  recorded.  The 
noise  reduction  achieved  in  this 
experiment  was  -24  dB. 

Many  experiments  were  performed  where 
both  signals  were  acoustically  recorded. 
Using  this  data  the  optimal  filter  was 
estimated  using  both  block  and  adaptive 
analysis.  The  results  of  these 
experiments  for  various  filter  lengths  are 
compared  in  Table  1. 

Observations  on  Results 

Examination  of  the  adaptive  filter's 
impulse  responses  for  the  synthetic 
experiments  showed  that  their  estimates  of 
the  channels  were  excellent  in  frequency 
bands  where  significant  noise  energy  was 
present,  and  very  poor  where  no  noise  wss 
present.  This  was  not  unexpected,  since 
the  adaptive  filter's  impulse  response  is 
a linear  combination  of  previous  reference 
noise  sample  vectors.  For  periodic  noise 
the  optimal  filter  was  not  unique  , 
however  the  noise  reduction  wss  as  good  ss 
thst  achieved  when  a white  noise  source 
was  employed. 

A comparison  of  the  results  obtained 
from  synthetic  and  actual  rooms  (-25  dB 
vs.  -24  dB  in  the  single  channel  case, 
and  -12  dB  vs.  -10.5  dB  for  the  two 
channel  case)  indicates  that  the 
assumption  thst  a room  can  be  modelled  as 
a linear,  stationary  channel  appears 
valid.  The  results  also  show  that 
spatially  distributed  noise  sources,  such 
as  multi-element  loudspeakers,  do  not 
cause  of  great  deal  degradation  in 
performance.  The  comparisons  of  filter 
length  verses  noise  reduction  show 
relative  performance  losses  caused  by 
filter  truncation.  They  are  applicable  to 
a single,  hard-walled  room  about  fifteen 
feet  square,  and  ought  not  to  be 
considered  universally  attainable  levels. 
The  absolute  noise  reduction  obtainable 
I for  a given  filter  length  is  extremely 
“dependent  upon  the  physical  environment 
where  the  process  is  being  employed. 

The  comparisons  of  the  ANC  approach 
and  the  global  block  analysis  showed  that 
the  adaptive  procedure  consistently 
performed  better  due  to  the 
nonststionarity  of  the  noise.  The  block 
analysis  was  not  developed  for 
nonststionary  data  and  attempted  to 


minimise  the  total  output  energy.  Also 
Levinson's  recursion  blew  up  when  trying' 
to  compute  a 3000  point  filter. 
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Figure  1 NOISE  CANCELLING  MODEL 
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Figure  2 BASIC  DATA  GENERATION  MODEL 
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Figure  3 REALISTIC  DATA  GENERATION  MODEL 
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ABSTRACT 

This  paper  describes  a theoretical  and  axparimental 
investigation  for  detecting  the  presence  of  speech  in  wideband 
noise.  A robust  algorithm  for  makirtg  the  silenca-vovad- 
unvoiced  decision  is  described.  This  algorithm  is  based  cn  a 
nonparametric  rank-order  statistical  signal-datect.jn  scheme 
that  does  not  require  a trainirtg  sat  of  data  and  maintains  a 
constant  falsa-alarm  rate  for  a broad  class  of  noise  inputs 
corresponding  to  a single  decision  threshold.  The 
nonparametric  rank-order  dacisian  procedure  is  the  multiple 
use  of  the  two-sample  Savage  T sutistw.  The  parfortTtanca  of 
this  detector  is  avaluatad  and  compared  to  that  obtained  by 
manually  classifying  iwanty  recorded  utterances  with  39.  30. 
20.  10.  and  0 decibel  signal-to-noisa  ratxM.  In  limited  tasting, 
the  average  probability  of  misclassification  is  less  than  S 
percent.  12  percent,  and  55  percent  for  signal-to-noae  rstioe  of 
39.  20.  and  0 dectiola  raspsctivoly 


INTRODUCTION 

The  furrdamantal  problem  m many  speech  communication 
and  understand irrg  svatanrs  is  the  search  for  a dscisian 
procedure  that  will  classify  speech  in  a noisy  environment  as 
voiced,  unvoiced,  or  silerrce  (rroiaa  alone).  For  several  years, 
the  rtotabla  advancss  in  narrowband  vocoders  have  motivstod 
invastigatKtn  into  the  theoretical  aspects  of  robust  speech 
classification  algorithms  that  will  effectivolv  operate  in  adverse 
noise  environments. 

A number  of  papers  arid  reports  have  boon  publishad 
dascribirig  the  theory  arid  tschniquss  for  makirig  the  voced- 
urwoiced-silaiics  (V-UV-S)  classification  (1);  tiowover.  vary  few 
papers  have  dealt  with  the  problem  of  devalopirig  effective 
algorithms  for  real  noise  environments.  In  most  of  these 
papers,  the  detection  of  speech  in  background  noisa  was 
coriductad  in  a rsaltively  noise-free  environment  under  ideal 
laboratory  acoustic  recordirig  coriditions.  The  decision 
procedure  that  has  enjoyed  the  wxlast  acceptance  is  the  pattern 
rscognitxin  approach  of  Atal  and  Rabirter  {2).  This  technique 
has  been  modHiod  by  various  investigations  [3].  The  paltom 
recognition  approach  to  the  V-UV-S  classification  has 
usefulness  for  marry  speech  processing  system  applications. 
However,  it  does  not  address  the  robustrrass  issue  in  a 
communicstion  sense  sxtca  the  technique  requires  a training 
set  of  data  arid  will  operate  without  degradation  in  performanca 
only  for  a particular  communication  chaniral 

McAulay  (4|  has  suggested  an  algorithm  for  detecting 
speech  in  an  airbonia  comirrand  post  noise  environment,  but  it 

•Rartially  supported  by  OARRA.  Information  i'*racessing  Branch, 
convact  N001 73-77 -C-0041  and  monitorsd  by  NRL. 


requires  a large  amount  of  signal  processirig.  a spsech-free 
interval  to  determine  noise  detection  thresholds,  and  has  not  as 
yet  been  extensively  tested. 

The  V-UV-S  decision  is  a difficult  problem  in  real  noise 
environments:  there  is  a need  for  continued  research  on  the 
theory,  technxiues.  artd  devices  in  this  area  [5]. 

In  the  research  described  here,  a rtonparamatric  rank-order 
statistical  decision  procedure  that  is  theoreticaify  recognized  as 
robust  in  a communication  sensa  has  baen  formulated  and 
investigated  with  a manually  classified  speech  data  base.  It  is 
theoretically  robust  in  the  communicalion  eanse  sineo  it  has  the 
desirable  property  of  maintaining  a constant  falsa-alarm  rate 
(CFAR)  for  a wxle  variety  of  noise  distributions.  The  decision 
threshold  is  set  independent  of  signal-to-noise  ratio  (6). 

The  detection  performance  for  the  Savage  two-sample 
nonparametric  rank-order  test  lor  speech  signals  in  wideband 
noiae  • presentad  m this  paper  A simple  version  of  the 
problem  • chosen  m order  to  make  a rigorous  analysis  poasibla. 
to  avaluats  the  applicability  of  nonparametric  procedures  to  V- 
UV-S  classification,  and  to  gam  clarity 

Although  this  dstaction  approach  is  new  to  speech 
processing,  it  a a mature  statistical  discipline.  The 
nonparsmetrc  detection  review  paper  by  Thomas  [7]  indicatas 
that  a bibliography  publiahod  in  1962  gives  more  than  3000 
references  The  application  and  analysa  of  nonparametric 
detections  hatorcally  has  bean  confined  to  nonengineermg 
problems,  an  angmeering  text  has  only  rscently  been  published 
(81 

Some  specific  advantages  of  nonparametric  statistics 
applied  to  speech  detectxm  are:  (1)  It  maintains  a constant 
false-slarm  rats  with  a fixed  threshold  for  large  classes  of  noise 
distributKins  (2)  It  does  not  require  statistical  information 
about  aither  the  signal  or  the  background  noise  (does  not 
require  a traming  set  of  data)  to  sat  a decision  threshold.  (3) 
Performance  for  signals  in  non-Gaussian  noise  may  often 
surpass  that  of  detection  optimized  against  Gaussian  noise.  (4) 
It  will  operate  where  the  noise  statistics  are  nonstationary  or 
change  from  one  application  to  another.  iS)  It  can  be  digitally 
implemented 

DESCRIPTION  OF  THE  ALGORTIHM 
System  Description 

The  system  operates  in  the  following  manner:  The  speech 
signal  is  low-pass  filtered  to  3.2  kHz  (tslephono  bandwidth), 
sampled  at  a 6.67-kHz  rate,  and  high-pass  filtered  at 
approximately  200  Hz  to  remove  any  dc  or  low-frequency  hum. 
The  output  from  the  high-pass  filtar  is  formulated  into  blodts  of 
100  samples  (IS  milliseconds  of  speech  data).  Each  block  of 
speech  is  then  applied  to  four  subband  digital  filters.  The  time 
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MriM  ou«Hit  of  each  lilior  it  labolad.  pooM.  and  rank  ordarad 
Tha  rank-ordar  vakjat  ara  than  pataad  to  tha  daiactor  or 
claaaifior  algorithm.  Figura  1 thowa  a block  diagram  of  tha 
dataction  algorithm. 

Tha  filtar  tubband  partitioning  it  batad  on  tha  work  of 
Crochiana  (9].  Tha  important  proparty  achiavad  by  thia  filtar 
bank  it  that  tha  turn  of  tha  individuai  fraquancy  ratpontat  of 
tha  ban^iaat  filtart  (compoaita  raaponaa)  liaa  flat  vrith  lirtaar 
phaaa.  Tha  daaign  of  tha  aubbanda  it  baaed  on  parcaptual 
criiaria.  Tha  band-partitioning  la  tuch  that  each  tubband 
contributaa'  equally  to  tha  articulation  iitdax  (Al|.  The  Al 
indicaiaa.  on  the  avaraga.  tha  contribution  of  each  part  of  the 
tpactrum  to  tha  overall  perception  of  tha  spoken  sound.  Figura 
2 shoiwa  the  partitioning  of  the  speech  spectrum  into  four 
contiguous  banrla.  Theta  filtart  were  designad  using 
McClellan.  Parks,  and  Rabinar't  program  |10). 

Dataction  Procedure 

To  avahiata  the  applicability  of  nonparamatnc  rank  order 
detactora  to  the  V-UV-S  classification  problem,  three 
aatumptiont  ware  made:  (1 ) Tha  spectrum  of  speech  is 
diffarant  from  bandlimitad  white  noisa.  (2)  Tha  noise  spectrum 
IS  approaimataly  flat.  13)  Tha  amplitude  distribution  of  speech  it 
approximately  Laplacian  (11); 

p(x)=  exp (-72^  I (I) 


Hz  It  ttatiatically  diffarant  from  the  subbanda  forming  the 
pooled  sample,  than  voiced  apaach  it  declared  at  tha  output  of 
the  datactor.  If  tha  oppoaita  condition  axiata.  than  unvoicad 
speech  is  declared  at  tha  output  of  the  datactor.  Tha  decision 
procedure  tasted  is  eloeely  related  to  the  nonperametric 
detection  procedure  uairtg  a spectral  data  concept  first 
introduced  by  Woinaky  (131. 

IWLSailiUS 

The  following  description  of  tha  Savage  teat  statistic 
follows  the  development  presented  m Hatek  (14).  Seice  the 
amplitude  disuibution  of  speech  is  naarly  exponantial.  the 
Savege  test  statistic  is  selsctad  because  it  is  the  optimum  rank 
statistic  for  an  exportantial  distribution  and  a scale  altemativa 
The  Savage  test  statistic  has  tha  form 

N 


where  Zi  is  a switching  function: 

_ . 1 if  the  ith  rank  belongs  to  the  filter  output  under  test 
i ~ 0 otherwise 


and  whera 


N 

A,=  Z 

j=N-l-*-l 


I 

7 


(3) 


where  8 is  tha  rms  speech  vakia. 

Tha  datactor  bssad  upon  those  assumptions  operates  in  the 
following  menner:  The  noise  soactrum  is  assumed  to  be 
approximately  flet  over  the  telephone  band  of  200  to  3200  Hz. 
This  fraquancy  band  is  analyzed  by  forming  four  contiguous 
subbaixls  Tha  subbanda  ara  chosen  so  that  each  subbartd  data 
block  is  inrtapandant  A two-sample  test  statistic  is  used  for 
each  subband  data  block.  The  time-sampled  data  in  tha 
subband  being  tested  forms  the  first  sample,  and  the  remaining 
pooled  data  forms  the  second  sample.  The  procedure  for  tha 
two-sampla  problem  is  to  combina  or  pool  both  samples  into  a 

single  ordered  sample  and  than  assign  ranks  (1.2 N]  to 

tha  sample  values  from  the  smallest  to  the  largest  vslua. 
without  regard  to  the  subbend  source  of  each  value.  The 
simplest  test  statistic  is  the  sum  of  ranks  assignad  to  the  values 
from  one  of  the  subbanda.  If  the  sum  (test  statistic)  is  too  large, 
there  is  some  indication  that  the  values  from  that  subband  tend 
to  be  larger  than  tha  values  of  the  pooled  second  sample  Tha 
null  hypothesis  Hq  of  no  difference  between  subbands  may  be 
rajactad  if  tha  ranks  associated  with  one  sample  tend  to  be 
larger  than  those  of  tha  other  sample:  and  the  alternate 
hypothesis  Ht  is  accepted.  Under  tha  assumption  that  the  rank 
of  any  singis  outcome  is  squally  Ihaly.  tha  probability  of  any 
wet  statistic  can  be  determined  by  counting  outcomes,  knowing 
there  are  Nl  total  permutations.  A tost  statistic  for  aach 
subband  data  btock  is  calculsied  and  compared  with  a threshold 
determined  from  siatistieal  tablaa.  This  decision  procedure  is 
laforerKSd  in  mathematical  litaratura  as  simultaneous 
statistical  infarenca  and  is  daaeribad  mors  fully  m (12|.  Tha  V- 
UV-S  decisxm  is  a single-sidsd  hypothesis  test  using  the  upper 
tail  of  tha  distributHtn  fuiKtion. 

The  detector  compares  a sat  of  m time  data  samples  from 
one  of  the  subbartds  with  prwlad  data  from  tha  other  subbanrts 
to  determma  if  the  sample  amplitude  distributxins  (AD)  are  tha 
same  or  diffarant  baaed  on  ranks.  Tha  form  and  paramaiars  of 
tha  ADs  are  unknown.  .he  sample  amplitude  distribuMns  in 
the  subbands  are  statistically  similar,  to  within  tasting  error, 
noias  only  is  daclarad  at  the  output  of  the  detector.  If  the 
sample  AO  m a aubband  with  a frequency  range  below  2000 


which  haavi.y  weighs  the  ranks  near  tha  upper  tail  in  tha 
critical  daciaxtn  region.  Under  Hq  the  Savage  statistic  satisfies 


E(SI  I m , N * m -h  n 

Vpr,s,x.^  O-Tfl  T' 


(4) 


Considsr  tha  amplitude  distribution  function  F ( • ) with 
stanrlard  deviation  8 and  zero  mean  corresponding  to  Hq.  For 
tha  condition  8|  > 8q  wa  have  F(x/8|  )<  F(x/8o  I in 
tha  critical  tast  region  of  the  upper  tail  Let  8 | correspond  to 
the  sample  from  tha  subband  under  tast  and  let  8o  correspond 
to  tha  pooled  population  unrlar  Hq.  If  voiced  or  unvoiced 
speech  is  present,  then  3i>9q,  otherwise  8|  < 8q. 
Consequently  the  hypothesis  test  can  bo  stated  as 


Hq:  8)  SSq 

H,:  8,>8o 


(5) 


which  can  bo  tasted  with  ranks  based  on  the  Savage  statistic  S 
sitd  a threshold  S°  selectad  from  rank  statistics  The 
procedure  follows. 

Under  the  null  hypothesis  Hq.  select  a threshold  S 4 such 
that  PfS  i S" ) : t-q  where  a is  the  probability  of  a type  I 
error  usually  between  0 10  and  0.01 . The  quantiles  of  are 
given  in  Table  X of  Haiak  (14)  for  N S 20  If  N > 20  a normal 
distribution  approxirrtation  can  be  used  considering  Eq.  1 and  2. 
Accept  Hq  if  S < S"  Othorwisa  reject  Hq. 

Ranking  of  tha  data  and  calculation  of  S may  require  an 
excassivo  number  of  computer  manipulatkMis:  the  procedure 
requires  that  all  data  from  the  filters  be  stored  tor  each  data 
frame  tor  ranking  purposes  This  problem  can  be  reduced  using 
a mixad  Savage  statistical  test  (14]  which  was  applied  to  the 
data  presentad  in  the  following  section. 
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I EXPCRUMENTAL  RESULTS 

Th«  nonparamatric  clataifiar  was  taaMd  on  tha  diagnostic 
myina  last  (ORT)  fils  taps  supplisd  by  Oyna  Stat  Incorporstad 
(4|.  Tha  addilivs  vnhiia  noisa  taps  was  gansratsd  by  digitiiing 
tha  analog  ou^t  of  an  analog  noisa  gsnaraior  Both  tha  word 
fila  and  lha  no«s  IHa  wars  prafillarsd  with  a low-pasa  fillar 
having  a 3.2  kHi  cutoff  frsquaiKy  and  wars  samplad  at  6.067 


r 


\ 
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Using  lha  softwara  programs  daacribsd  in  (161  a'Nl  tha  ORT 
data  fila.  a controllsd  ORT  word  data  bass  with  additiva  whits 
noisa  of  prograssMSly  smallar  signal-to-noisa  ratioa:  39.  30, 
20.  10.  and  0 dB  wars  crsstsd  and  proesssad  by  tha  dsiactor 
algorithms.  Tasts  wars  conduciad  to  svsiuata  tha  spsach 
datactor's  parformanca  for  tha  fivs  dMfsrant  signal-to-noisa 
ratioa  uf  widaband  Gaussian  noisa.  For  aach  cisan  last  word 
from  tha  ORT  fila.  a manual  analysis  was  parformsd  on  aach 
15-millisscond  intarval  to  classify  it  as  voicsd.  unvoicad.  or 
silarKS  bassd  on  visual  nspaction  of  tha  acoustic  wsvaform 
and  a phonatic  misrprstation  of  tha  uttsrancs  Two 
indapandsnt  manual  classifications  wars  mads  on  aach  tsst 
word. 

A V-UV-S  dacison  was  mads  by  tha  computer  avary  15 
milliseconds  basad  on  a mwad  Sa"aga  statistic  using  100 
samples  from  aach  filter  subband  raprasantsd  in  figurs  2 Tha 
mixed  Savage  statistK  was  formed  by  averaging  tha  absohita 
value  of  5 samples  forming  20  averaged  samples  par  subband. 
Tha  80  averages  from  tha  tour  subbandv  ware  pooled  and 
ranked. 

Error  rates  ware  computed  by  comparing  tha  manual 
classificatKin  with  tha  datactor's  classMication  output.  Table  1 
summaraas  tha  overall  recognition  rata  as  a function  of  signal- 
to-noisa  for  the  simultaneous  decision  procedure  for  all  20  test 
uttarancas. 

Tha  recognition  results  in  Table  1 and  spectral  analysis  of 
(ha  ORT  background  noise  indicaia  that  a significani  low- 
Iraquancy  spectral  component  is  present  in  tha  bsckgrourj 
noisa  of  tha  ORT  file.  Table  1 and  tha  additional  last  dascribad 
in  (1)  show  that  as  noisa  is  added,  tha  affect  is  to  whiten  tha 
spectrum,  and  therefore,  tha  misclassification  rata  dacraasas  at 
30  dS  as  compared  to  39  dB 

CONCLUSIONS 

A thaoratical  and  experimental  Investigation  for  datscting 
the  prasanca  of  speech  in  wideband  noise  and  classifying  tha 
detected  utterance  as  voiced  or  unvoicad.  basad  on  a 
nonparamatric  statistical  detection  approach,  has  bean 
dascribad  Tha  speech  datactxin  tachnxiua  that  was  tasted  is 
affactiva  lOr  detecting  speech  in  widaband  noisa  at  a signal-lo- 
noiaa  ratio  from  39  to  0 dB  and  maata  tha  raquiramani  for 
being  independent  of  transmission  channel  characteristics, 
racording  conditions,  and  distraiutHin  of  tha  background  random 
noise.  Tha  desirable  feature  of  this  datactH>n  or  classification 
schema  is  that  is  requires  neither  a training  sat  of  data  nor  a 
prion  information  of  tha  statistical  paramatars  of  speech  or 
background  noisa 
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ABSTRACT 

Llneer  predictive  coding  (LPC)  hee  been  euc- 
ceeefully  epplled  to  the  encoding  of  epeech  end 
other  tlB*  eerie*.  It  he*  been  widely  observed, 
however,  thet  the  perforasnee  of  *n  LPC  slgorltha 
deterlorete*  repldly  In  the  preeence  of  beckground 
nole*.  In  thl*  peper,  we  describe  end  dlecuee  one 
epproech  to  the  Identlf Icetlon  of  * tlae  serle* 
corrupted  by  additive  white  noise. 

A coaaon  approach  to  this  problea  Is  to  pre- 
filter  the  noisy  tine  aeries,  and  then  to  apply  an 
estlaatlon  slgorltha  which  treats  the  tlae  series 
as  If  It  were  noise-free.  He  describe  an  alterna- 
tive approach  which  Involves  aodlfylng  the  tlae- 
serles  aodel  at  the  outset  to  account  for  the 
presence  of  noise.  An  estlaatlon  slgorltha  Is 
then  developed  for  this  aodlfled  aodel.  He  dis- 
cuss the  davelopaent  of  the  aodel,  the  estlaatlon 
slgorltha,  and  soae  representative  experlaental 
results . 

INTRODUCTION 

Linear  Predictive  Coding  (LPC)  has  been 
widely  and  successfully  applied  to  the  encoding 
and  processing  of  speech  waveforas  and  other  tlae 
series.  Most  of  the  Initial  deaonstratlons  of  LPC 
were  conducted  using  high-quality  and  relatively 
nolse-fraa  signals,  however.  It  hee  recently 
becoae  clear  that  background  noise  and  other  per- 
turbations can  cause  a serious  degradation  In  the 
perforasnee  of  LPC  algorlthas  (1,  2).  In  speech 
processing,  for  exaaple,  the  presence  of  noise  can 
adversely  affect  silence  detection,  volced/unvolced 
deteralnatlon,  pitch  period  calculation,  and  Iden- 
tification of  the  LP  coefficients.  The  work  dla- 
cuased  In  this  paper  deal*  only  with  the  problea  of 
coefficient  Identification,  and  Is  applicable  to 
any  tlae  series  which  can  be  aodelad  as  an  all-pole 
or  autoregressive  (AR)  process  perturbed  by  addi- 
tive white  noise.  He  aake  the  further  slapllfylng 
asauaptlon  that  the  order  of  the  process  Is  known; 
thus,  only  the  unknown  coefficients  of  the  differ- 
ence aquation  defining  the  AR  process  aust  be  estl- 
aated  froa  the  observed  data. 

Several  scheaes  have  been  developed  to  deal 
with  the  effects  of  noise  on  LPC  estlaatlon  algo- 
rlthas (1,  3,  4).  The  approach  to  noisy  tlae- 
eerlas  analyala  which  we  discus*  In  this  peper  In- 
volves a aodlflcetlon  of  the  process  aodel  at  the 


outset  to  account  for  the  effects  of  additive  white 
noise.  He  show  that  the  addition  of  white  noise  to 
an  AR(q)  process  (an  all-pole  process  with  q poles) 
results  In  a new  process  which  Is  an  autoregres- 
sive aovlng-average  (ARMA)  process  with  q pole* 
and  q seroes.  Furtheraore,  the  poles  of  the  new 
8RKA  (q,  q)  process  are  Identical  to  the  poles  of 
the  original  AR(q)  process,  a fact  which  greatly 
slapllfles  the  estlMtlon  process.  By  aodlfylng 
the  aodel  In  this  way,  we  transfora  the  problea 
of  estlaatlng  the  paraaeters  of  an  AR  process  In 
the  presence  of  noise  Into  a problea  of  estlaatlng 
the  paraMters  of  an  ARMA  process  which  has  the 
same  poles  or  AR  coefficients  as  the  original  pro- 
cess. 

Optlaal  estlaatlon  of  the  paraaeters  of  an 
ARMA  process  Is  snich  aore  difficult  than  estlaat- 
lng the  paraaeters  of  an  AR  proceas.  Our  objec- 
tive In  this  paper  Is  to  detecaln*  whether  there 
Is  any  perforsMnee  advantage  to  be  gained  using 
the  approach  deacrlbed  above,  and  we  do  not  con- 
cern ourselves  with  coaputetlonal  efficiency  per 
*e.  If  this  aethod  were  to  be  lapleaented.  It 
would  no  doubt  have  to  be  aodlfled  to  Increase  Its 
speed . 

Estlsutlon  of  the  paraMters  of  an  ARMA  pro- 
cess has  been  extensively  discussed  In  the  litera- 
ture (5,  6,  7).  He  describe  an  slgorltha  devel- 
oped by  Anderson  (7)  for  conditional  aaxlaua- 
llkellhood  estlawtlon  using  a version  of  the 
Newton-Raphson  swthod. 

Finally,  we  present  the  results  of  a nuaber 
of  experlaent*  conducted  using  slaulated  tlae- 
serles  data.  He  coapare  the  eatlaates  obtained 
using  the  Newton-Raphson  aethod  with  those  ob- 
tained by  applying  the  standard  autocorrelation 
aethod  of  LPC  estlawtlon  to  both  the  unaodlfled 
noisy  tlae  series  and  to  a Hlener-flltered  version 
of  this  noisy  tlae  series.  He  also  Include  re- 
sults obtained  by  solving  the  "shifted"  Tule- 
Halker  equation*  (8). 

THE  MODEL 

In  this  section,  we  give  a very  brief  devel- 
opaent  of  the  aodel  which  result*  when  whit*  noise 
is  added  to  an  AR  proceas.  For  aore  detail*,  see 
(9)  or  (10). 

We  asswe  that  the  desired  signal  process 
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1 


•(k)  la  a acatlonary  AR(q)  proceaa  of  kno%m  ordar 
q daacrlbad  by 

I a(l)a(k  - 1)  - e(k),  (1) 

1-0 

whora  e(k)  la  a aequence  of  Independent,  Identl- 
cally-dlatrlbuted  Gaussian  randon  variables  with 
■ean  aero  and  variance  o|.  Ue  further  assuae  that 
a(0)  • 1 and  that  q > 0.  This  signal  process  la 
contaalnated  by  a sequence  n(k)  of  Independent, 
Identically-distributed  Gaussian  randoa  variables 
with  mean  aero  and  variance  to  fora  the  observ- 
able sequence 

x(k)  - e(k)  + n(k).  (2) 

It  can  be  shown  (8,  9)  that  x(k)  satisfies 
the  relationship 

9 2 

I a(l)x(k  - 1)  - I b(J)v(k-J),  (3) 

1-0  J-0 


x(k)  - v(k) 
aatrlx  fora 

where 

and 


0 for  k < 0,  ife  can  write  (3)  In 

A X - B v (5) 

9 . 

A - J a(l)  (6) 

1-0 

i-  I b(j)  (7) 

J-0 


The  conditional  log  likelihood  function  (condi- 
tioned upon  the  inital  values  assuaed  for  x(k)) 
can  now  be  %n:ltten 

ln(f)  - - I tn(2»)  - I fn(o^)  + In  |a|  - In  |b| 
1 


where  b(0)  - 1 and  v(k)  la  a sequence  of  Indepen- 
dent, Identically-distributed  Gaussian  randoa  vari- 
ables with  aean  aero  and  variance  o^.  Thus,  the 
observed  noisy  process  x(k)  can  be  viewed  as  an 
ABMA  (q,  q)  process  with  AR  coefficients 
{a(l))|,j^,  MA  coefficients  {b(j))5.i,  and  driving- 
sequence  variance  Oy.  This  new  aodel  contains 
2q  -f  1 paraaeters  coaipared  with  q -t  2 for  the 
original  aodel. 

Upon  coaparlng  (1)  and  (3) , we  see  that  the 
AR  coefficients  of  the  new  ARMA  aodel  are  Identi- 
cal to  those  of  the  desired  signal  process  s(k). 
Hence,  after  est lasting  the  pereaeters  of  the  ARMA 
process,  we  can  slaply  discard  the  MA  cstlaatea 
and  retain  the  AR  estlaates.  This  result  tests 
on  the  assuaption  that  the  additive  noise  is  white. 
If  It  is  not,  a siallar  result  can  be  estebllshed 
but  the  AR  paraaeters  will  no  longer  be  the  saae. 

PARAMETER  ESTIMATION 


We  showed  In  the  previous  section  that  estl- 
aatlon  of  the  pareaeters  of  an  AR  procass  contaal- 
nstad  by  additive  idilte  noise  can  be  accoi^llshed 
by  estlaetlng  the  parsawters  of  an  associated  ARMA 
process.  We  have  edopted  and  laplaacnted  an  ARMA 
estlaatlon  algorltba  of  Anderson  (7)  for  this  pur- 
pose, and  we  briefly  describe  this  algorltha  In 
this  section. 


Of  the  several  aethods  described  In  (7) , the 
one  as  aelacted  Is  the  so-called  tlae-doasln 
Nawtoo-Raphson  asthod.  To  begin,  we  define  the 
N X N aatrlx 


L^-k 

where  1||_||  f*  (N  - k)  « (N  - k)  Identity  aa- 
trlx. Further,  define  coluan  vectors  x - (x(0)... 
x(R  - 1)1^  sad  V - [v(0)...v(N  - 1)^.  Then 
• (0...0  s(0)essX^  - 1 - k)J*. 

Using  the  an trices  L^,  sod  asaualns  that 


The  conditional  aaxlmua  likelihood  estlaates 
for  {a(l))?_i,  (b(j))1,j,  and  Oy  can  be  obtained 
In  principle  by  differentiating  (8)  with  respect 
to  each  of  these  paraaeters,  equating  the  results 
to  zero,  and  solving  the  resulting  set  of  slaul- 
taneous  equations.  Unfortunately,  these  equations 
are  nonlinear  in  the  paraaaters  and  cannot  be 
solved  directly.  Thus,  we  aust  resort  to  Iterative 
aethods  of  solution. 

The  estlaatlon  of  Oy  can  be  decoupled  froa 
the  estlaatlon  of  the  a(l5  and  b(J).  Specifically, 
If  we  use  (5)  to  define 

1 ■ A x,  (9) 

then 

®y  “ M V.  (10) 

To  estlMte  the  a(i)  and  b(J),  tte  define 
a - ta(l)  ...  a(q)]T,  b - (b(l)  ...  b(q)]T,  and 
e_  - [a^  t b^]^.  The  estlaate  of  ^ is  obtained 
Iteratively  using  the  equation 

*4+1  “ til) 

where  Is  the  gradient  vector  and  ^ Is  a aatrlx 
whose  elesMnts  will  be  given  below.  To  use  (II), 
an  Initial  value  6^  la  chosen,  ^ and  are  cal- 
culated, and  these  values  are  then  used  to  obtain 
an  updated  estlaate  6^.  The  process  is  then 
repeated  Iteratively  until  soae  stopping  criterion 
Is  satisfied. 

It  is  convsnlant  to  express  R and  g In  the 
partitioned  foras 
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TIm  ^Mctors  w and  u ara  q > 1 eoluan  vectors  whose 
Jth  aad  Mh  aloaea?s  are,  raspactlvely. 


M. 


0 

V 

(14) 

(15) 

and 


) . Tha  eleawnts  of  £,  Q , and  £ are 

given  by 

(16) 

(17) 

(18) 

we  used  the  true  psrsaeter  vslues  ss  the  Inltlsl 
vslues.  We  did  this  on  purpose  In  order  to  svold 
extensive  discussion  of  this  Issue.  The  prlaary 
affect  of  using  other  reasonsblc  Initial  guesses 
should  be  a aodast  Incrcaae  In  the  rate  of  failure 
to  converge,  and  does  not  seriously  affect  our 
conclusions.  This  Is  borne  out  by  soae  results 
obtained  when  we  did  not  know  the  trtie  values  for 
the  ■ovlng-avcrage  parameters  and  therefore  were 
forced  to  use  other  starting  values. 

Estimates  of  the  AR  parameter  a were  obtained 
for  518  frames,  each  containing  256  points  of  data, 
for  each  of  six  slgnal-to-nolse  ratios.  The  meth- 
ods used  to  obtain  these  estimates  were  the  follow- 
ing: 

1.  The  tlM-domaln  Newton-Rsphaon  method 
described  above. 

2.  The  standard  autocorrelation  method  of 
LPC. 

3.  Solution  of  the  shifted  Yule-Walker  equa- 
tions to  account  for  the  moving-average 
portion  of  the  process. 


To  obtain  R^  and  wn  simply  substitute  the 
parameters  from  the  1th  estimate  6^  Into  (12)-(18). 
Tor  a detailed  discussion  of  the  computatlona 
Involved,  see  [8]. 

EXPERIMENTAL  RESULTS 

We  have  conducted  extensive  tests  of  the 
time-domain  Nawton-Raphson  algorithm  described  In 
Che  prevloua  aectlon,  and  have  compared  Its  per- 
formance to  Chat  of  eeveral  other  estimators.  In 
this  section,  we  briefly  summarise  some  represen- 
tative results  and  discuss  tha  conclusions  we  have 
drawn  from  these  results.  A discussion  of  results 
obtained  using  an  algorithm  developed  by  Stelgllts 
{11]  appears  elsewhere  In  these  proceedings  [12], 
aad  a much  mere  extensive  discussion  of  all  our 
results  appears  In  [8]. 

Although  we  have  conducted  tests  on  higher- 
order  proceaees,  we  restrict  our  attention  here  to 
a flrat-erder  AR  process  contaminated  by  additive 
white  nolae.  For  deflnlteneas,  we  took  the  single 
AR  parameter  to  be  0.5.  As  an  Initial  test  of  the 
performance  of  the  Nawton-Raphson  algorithm  we 
per formed,  for  a number  of  256-polnt  frames  of 
data,  a straightforward  search  In  (a,  b)  parameter 
space  to  locate  that  point  which  minimised  the 
unconditional  sum  of  squarad  residuals  (see  (10), 
Chapter  7).  The  Newton-Raphson  algorithm  was 
appllad  to  the  same  data,  and  Its  estimates  of  a 
and  b were  compared  with  the  values  obtained  using 
tha  search  procedure.  In  all  cases  In  which  the 
Newton-Raphaon  procedure  converged,  the  results 
agreed  very  closely.  These  results  confirm  chat 
whan  poor  aatlmates  are  obtained  using  the  NR 
mathod.  It  la  almost  always  the  case  that  thesa 
poor  estimates  rsally  do  mlnlmlss  the  sisi  of  the 
squared  residuals.  Thus,  tha  weaknass  la  not  In 
tha  Mt  algorlttm  per  ae,  but  Is  Inherent  In  the 
underlying  lesat-aquarss  approach  to  sstlmatlon. 

Is  most  smparlmsnts  using  the  NR  algorltlm. 


4.  Wiener  filtering,  assuming  knowledge  of 
the  slgnsl  and  noise  spectra,  followed 
by  LPC  estimation.  In  practice,  these 
spectra  are  not  known,  and  In  fact  are 
to  be  estimated,  but  this  approach  pro- 
vides an  Indication  of  what  can  be 
achieved. 

The  results  obtained  using  these  four  meth- 
ods were  averaged  over  the  518  frames  of  data,  and 
these  average  estlMtes  are  plotted  In  Fig.  1.  In 
the  case  of  the  NR  method,  the  average  was  taken 
only  over  Chose  frames  for  which  convergence 
occurred  (515  at  0 dB,  214  at  -10  dB,  518  In  all 
ocher  cases).  In  terms  of  these  average  results. 

It  Is  clear  that  the  NR  and  shifted  Yule-Walker 
methods  are  superior  to  the  other  methods.  In 
particular,  the  SNR  threshold  below  which  the 
estimate  becomes  very  poor  Is  roughly  14  dB  lower 
for  the  NR  method  than  for  the  uncorrected  LPC 
method. 

Looking  only  at  the  averages  can  be  somewhat 
misleading,  however.  A more  complete  picture  Is 
obtained  by  looking  at  the  variances  of  the 
estimates,  and  here  some  of  the  advantage  of  the 
NR  algorithm  Is  lost.  The  variance  of  the  NR  estl- 
sMte  Is  appreciably  larger  than  that  of  tha  LPC 
estimate,  as  Is  shown  In  Fig,  2.  Thus,  although 
the  NR  method  Is  superior  on  the  average,  the  LPC 
estimate  will  actually  be  better  for  a significant 
numbar  of  Individual  frames. 

SUMMARY 

In  this  paper,  we  have  shown  that  an  AR  (all- 
pole) process  contaminated  by  additive  white  noise 
can  be  modalad  as  an  ARMA  (pole-saro)  process  whose 
poles  are  Identical  to  thosa  of  the  original  AR 
process.  Thus,  the  problem  of  estimating  the 
parametera  of  a nolay  AR  proceaa  can  be  tranafomed 
into  one  of  estimating  the  parametera  of  an  ARNA 
process,  unfortunately  a much  harder  problem. 


42 


I 


» 


i 


! 

1 

I 


I 


H«  laplaaantad  an  ABIU  aatlaatlon  algorltha 
oaing  tha  Nawton-Eaphaon  approach,  and  Chen  appliad 
thla  algorltha  to  a larga  aaount  of  data  fron  a 
aynthaclc  noiay  AE  (1)  procaaa.  Thla  procadura 
ylaldad  conaldarably  battac  raaulta  on  the  average 
than  did  an  unaodlfled  IfC  algoritha,  but  thla 
advantaga  la  quallflad  by  tha  fact  t^t  tha  varl- 
anca  of  tha  ME  aatlaata  la  graatar  than  that  of 
tha  LPC  aatlaata. 
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Fig.  1.  Coaparlson  of  four  aatiaators 
of  a(l).  True  value  > 0.5. 


Fig.  2.  ME  and  LPC  aatiaates  of 
a(l)  with  ia  Halts. 
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ABSTRACT 

Stelglltz  (2)  has  recently  described  an 
algorithm  for  estimating  the  paraMters  of  an 
autoregresslve-BK>vlng-average  (ARMA)  process. 

This  algorithm  has  application,  for  example,  to  the 
problem  of  determining  the  poles  and  zeroes  of  the 
vocal-tract  transfer  function. 

In  this  paper,  we  report  and  discuss  the 
results  of  a number  of  simulations  conducted  using 
the  Stelglltz  algorithm.  The  bulk  of  the  experi- 
ments Involved  driving  the  ten-pole,  two-zero 
filter  described  In  (2)  with  a single  pulse,  with 
a short  pulse  train,  and  with  samples  of  white 
Gaussian  noise.  In  each  of  these  cases,  we  eval- 
uated the  effects  of  such  processing  options  as 
windowing,  preemphasls,  and  cepstral-domain  filter- 
ing. He  also  discuss  and  compare  results  obtained 
by  applying  the  Stelglltz  algorithm  and  a Newton- 
Raphson  conditional  aaxlzwm-likellhood  algorithm 
to  a first-order  process. 


INTRODUCTION 


Stelglltz  and  McBride  (1)  propose  a system 
Identification  procedure  in  which  the  s-domaln 
transfer  function  of  the  unknown  system  is  B(z)/ 
A(s).  B(z)  and  A(z)  are  polynomials  given  by 


A(z)  - 1 a(l)z”^, 
1-0 

a(0)  - 1.0, 

(1) 

P 

B(s)  - 1 b(l)i"\ 

b(0)  - 1.0. 

(2) 

1-0 

The  polynomial  A(t)  determines  the  pole  locations 
of  the  model  and  la,  equivalently,  the  autofegres- 
slvo  (At)  operator.  The  zero  locations  are  deter- 
mined by  B(a),  the  moving-average  (MA)  operator. 
Thus,  If  the  driving  sequence  v(k)  la  a idiite 
noise  sequence,  the  response  x(k)  la  an  ARMA  pro- 
eaae.  Assuming  that  the  Input  V(t)  and  output 
X(a)  are  known,  the  model's  reeponea  Is  D(a)  - 
(B(t)/A(t)]V(s).  The  error  Is  then  given  by 
B(s)  - DCs)  - X(t).  After  linearising  the  modal, 
Stelgllts  and  NcBrlda  arrive  at  an  iterative  pro- 
cedure which  aetlmetea  the  eeefflclanta  a(l) 


a(q),  b(0),  ....  b(p). 

In  (2) , Stelglltz  applies  this  method  to  data 
obtained  from  a system  In  which  the  Input  V(z)  Is 
unknown.  The  data  x(k)  are  assumed  to  reslut  from 
driving  the  unknom  system  with  an  Impulse. 
Stelglltz  applies  this  method  to  a ten-pole,  two- 
zero  "unknown"  system  In  which  the  Input  v(k)  is 
actually  an  Impulse  train,  simulating  voiced 
speech.  Because  the  data  x(k)  are  assumed  to 
result  from  an  Impulsive  Input,  Stelglltz  proposes 
that  x(k)  be  modified  prior  to  analysis  to  lizprove 
that  assumption.  Preemphasls,  windowing,  and 
cepstral-domain  operations  are  suggested  toward 
that  end. 


EXPERIMENTAL  RESULTS 


The  application  of  this  algorithm  to  data 
generated  using  white  noise  as  the  input  to  the 
ten-pole,  two-zero  model  used  by  Stelglltz  Is 
reported  in  (3).  This  represents  the  situation 
usually  encountered  In  ARMA  model  estimation.  For 
comparison,  the  algorithm  Is  also  applied  to  data 
generated  with  Inputs  of  a single  Impulse  and  an 
izqtulse  train.  The  various  modifications  to  x(k) 
proposed  by  Stelgllts  are  performed.  The  result- 
ing sequence  Is  analyzed  to  obtain  estimates  of  the 
ARMA  coefficients.  Results  are  reported  here  on 
those  modifications  which  produced  estimates  of 
the  ARMA  coefficients  having  the  smallest  mean 
square  error  when  compared  to  the  coefficients 
used  to  generate  the  data. 

Figure  1 shows  the  spectrum  (In  dB)  of  the 
system  to  be  Identified.  Using  an  Input  of  a 
single  Impulse,  the  resulting  data  sequence  x(k) 
la  shown  In  Fig.  2.  The  best  estimates  of  the 
ARMA  coefficients  sre  obtained  by  applying  the 
algorithm  directly  to  x(k).  The  spectrum  of  the 
estimated  model  after  one  Iteration  is  shown  In 
Fig.  3.  There  la  essentially  no  error  in  the 
estimate. 

The  next  cm  IS  to  be  discussed  Is  the  analy- 
sis of  data  ganenited  using  am  Impulse  train  me 
the  input.  The  t ipwleee  occur  every  100  sample 
points.  Ths  raamltlmg  output  Is  shown  la  Fig.  4. 

In  this  emsa,  ths  followlag  modifications  are  made 
to  the  data  x(k)i 
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1.  HimiIih  window  x(k). 

2.  Window  tho  coaplax  copotrua  of  x(k). 

3.  Trnnafora  tho  rooulting  cepstral 
ao^uonco  to  obtain  tha  tiaa  doaain 
aaquanca  x^(k). 

Aftar  windowing  x(k)  in  atop  1 to  obtain 
w(k)  • x(k),  tho  data  ara  tranaforaad  uaing  an 
N-point  OFT.  Prior  to  eoaputing  tha  coaplax 
capatrwa,  tha  aignal  ia  forcod  to  havo  laro-phaaa. 
Tha  coaplax  logaritha  thua  bacoaaa 

ladog  X(t))  - I logjtRe  X(t)l^  + 1b[X(1) l^j(3) 

ladog  X(t))  - 0,  (4) 


whara  X(i)  la  tha  DPT  of  w(k)  • x(k).  Aftar  the 
windowing  oparatlona  to  be  porforaad  on  the  cep- 
atrua,  tha  aaro-phaaa  aaauaption  ylalds  the  aaae 
raaulta  aa  thoaa  that  would  be  obtained  using  the 
actual  phase  of  X(l) . The  iaposltlon  of  sero 
phase  avoids  the  necessity  of  phase  unwrapping. 
The  coaplax  capatrua  x(k)  of  this  xaro-phase  sig- 
nal is  found  by  parfomlng  another  DPT  on  tho 
coaplax  logaritha. 


TWO  operations  are  now  perfomad  on  x(k) . 
First,  let  i^(k)  ■ u(k)  * x(k),  where 


u(k) 


1,  k - 0 

<2,  k - 1 j 

0,  k - 1+  1 h 


Tha  cepstna  is  now  causal,  and  tha  corresponding 
tiaa-doaaln  signal  has  a aagnltuda  spactrua  iden- 
tical to  that  of  w(k)  • x(k).  The  second  opera- 
tion parforaad  on  x(k)  is  to  aero  the  portion  of 
the  cepstna  having  the  pitch  spike  associated 
with  tha  periodic  nature  of  x(k).  The  cepstral 
signal  x^(k)  resulting  froa  these  two  procedures 
is  transforsMd  back  to  tha  tlae-doaaln  signal 
x_(k).  Tha  cepstral  processing  has  achiavad  two 
gMla: 


1.  X|^(k)  la  a alniaua  phase  sequence. 

2.  Tha  periodic  nature  of  x(k)  la  sup- 

preasad . 

Tha  assiaptlon  of  an  iapulalve  input  is  aore  nearly 
valid  for  x^(k)  than  for  x(k) . Analysis  to  dater- 
alna  tha  Aim  coefficients  is  parfora^  on  x^(k), 
which  is  shown  in  Pig.  3. 

It  was  found  that  tha  Haalng  window  atap 
was  nacasaary  to  obtain  a convergent,  stable  fil- 
ter astlaata.  Praanphasis  was  not  parforaad  on 
either  x(k)  or  **  degrade 

tha  eoafflelent  Mtiaatea  aoaewhat.  Tha  spactrua 
of  tha  astlaata  aftar  two  Itarationa  is  ahown  in 
Pig.  t and  ia  quite  good,  confiraing  tha  results 
in  (2). 


Tha  last  case  to  be  conslderad  is  for  data 
generated  whan  the  input  to  the  systaa  is  an 
approxlaately  white  noise  sequence.  The  resulting 
output  is  shown  in  Fig.  7.  Tha  bast  astiaatas  in 
this  case  were  obtained  by  analysing  x(k)  directly. 
Unlike  the  lapulsa  excltad  casa,  however,  the 
astlaata  is  poor.  Figure  8 shoira  the  spactrw  of 
the  astlaata  aftar  two  iterations.  Further  Itera- 
tions result  in  progressively  narrower  and  higher 
spectral  peaks.  The  astlaata  often  bacoaas  un- 
stable. Tha  algorltha  is  no  longer  achieving  the 
excellent  results  found  in  tha  other  two  cases. 

Because  the  tenth  order  AR  operator  is  likely 
to  tax  any  estlaation  algorltha,  the  algorltha 
developed  by  Steiglits  was  applied  to  a single-pole 
systea  (q  ■ 1,  p > 0),  excited  by  white  noise.  The 
single  denoalnator  coefficient,  a(l),  was  O.S  for 
this  test.  The  estlaate  for  a(l)  froa  the  Stelg- 
lltt  algorltha  is  coapared  to  that  obtained  froa  a 
Newton-Raphson  (NR)  lapleaentation  of  a aarlaua 
likelihood  ARMA  estlaation  procedure  (4) . The 
results  for  ten  iterations  of  one  fraae  of  data  are 
presented  in  Table  1.  The  initial  guess  for  a(l) 
in  both  algorlthas  is  O.S,  the  ectual  value  of  the 
coefficient  used  to  generate  the  date.  Using  this 
as  the  initial  guess  resxives  the  uncertainty  about 
an  initial  guess  froa  the  test.  Froa  Table  1,  we 
see  that  after  the  first  Iteration,  the  NR  estlaate 
does  not  change  in  at  least  the  five  aoet  slgnlfi- 
esnt  figures.  The  Steiglits  estlaete,  however, 
varies  conslderebly  froa  Iterstion  to  iteration. 

The  estlaate  at  Iterations  2,  3,  and  4 is  unstable. 
Convergence  does  occur  in  later  itarationa,  but  to 
a value  indicating  the  pole  is  close  to  the  unit 
circle.  This  results  in  a narrow,  high  peak  in 
the  spectrua,  characteristic  of  the  estlaate  in 
the  tenth  order  case.  In  addition,  coaputations 
using  the  Steiglits  algorltha  requlrd  the  uae  of 
double  precision  arithaetlc,  even  in  the  previous 
successful  cases.  The  NR  aethod  does  not  require 
double  precision  arithaetlc  for  successful  paraa- 
eter  estlaation. 


CONCLUSION 


The  results  of  the  tests  perforaed  hers  con- 
flra  that  the  paraaeter  estlaation  algorltha  pro- 
posed by  Steiglits  (2)  produces  good  results  for 
the  lapulse-  and  iapulse-traln-exclted  cases.  Care 
Bust  be  taken,  however,  in  choosing  aodiflcationa 
to  x(k) . The  perforaance  of  the  algorltha  in  tha 
noiae-exclted  case  is  poor,  even  for  a flrat-order 
process.  The  algorlthai  does  not  appear  to  be 
applicable  to  the  problea  of  estlaatlag  the  paraa- 
eters  of  a noise-excited  ARMA  process. 
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Fig.  1.  10-pola,  2 aero  apectrua  to  be  Identified. 


Fig.  3.  Eatlaate  of  aodel  apectrua 
froa  lapulae-cxclted  output. 


Fig.  S.  Modified  ayatea  output 

after  cepatral  proceaalng. 


Fig.  2.  Syatea  output  when  excited  by  lapulse. 


-a  4«ac»a 


Fig.  4.  Syetea  output  when  excited 
by  lapulae  train. 


Fig.  6.  Eatlaate  of  aodel  apectrw  froa  aodl- 
fied  lapulae-traln-exclted  output. 
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Fig.  8.  Estlaate  of  aodel  spectrua 
Fig.  7.  Nol*«-«xclted  output.  froa  nols«-«xcltod  output. 


Table  1.  Coaparlson  of  the  Stelglltz  and  NR 

estlaatea  for  a(l)  of  an  AR(1)  procesa. 
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.89261 

.97296 

.99496 

.99732 

.99728 
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ABSTRACT 

Modification  of  the  rata  of 
occurrence  of  acoustic  events  without 
alterinq  frequency  content,  and 
modification  of  pitch  without  changinq 
time  scale  are  presented  as  equivalent 
problems.  While  the  short-time  Fourier 
transform  has  been  used  to  solve  the  rate 
modification  problem,  it  is  not  a natural 
tool.  It  lacks  the  scaling  property  of 
the  Fourier  transform.  The  constant-Q» 
transform,  on  the  other  hand,  exhibits 
this  property.  A more  natural  rate/pitch 
modification  system  using  the  constant-0 
transform  is  presented  which  performs  well 
with  rate/pitch  changes  by  factors  of 
between  one-third  and  three. 


INTRODUCTION 

The  possibility  of  modifying  the  rate 
at  which  speech  is  articulated  has 
prompted  a variety  of  efforts,  ranging 
from  simple  time-base  scaling  and  the 
excision  or  insertion  of  waveform  segments 
to  the  more  successful  and  complete 
approach  used  recently  by  Portnoff  [1] 
involving  time  scaling  of  the  short-time 
spectrum.  These  efforts  have  been 
hindered,  in  part,  by  the  difficulty  in 
satisfactorily  defining  what  is  meant  by 
rate-changed  speech.  As  Portnoff  points 
out,  this  difficulty  results  from  the 
ambiguity  in  classifying  aural  events  as 
having  either  temporal  or  harmonic 
significance.  Clearly,  aspects  of  a 
signal  which  appear  to  manifest  themselves 
harmonically  should  be  preserved  in  rate 
compression  or  expansion  of  a signal, 
while  characteristics  which  are  perceived 
in  time  ought  to  be  subject  to 
modification.  The  short-time  Fourier 
transform,  which  maps  signals  into  a 
t%K>-dimenslonal  space,  separates  the  time 
and  frequency  content  of  a signal  based  on 
the  assumption  that  variations  which  occur 
over  intervals  greater  than  the  constant 
time  resolution  of  the  analysis  window 
will  be  classified  as  temporal  events, 
while  other  characteristics  must  be 
measured  along  the  frequency  axis.  Hence, 
while  constant  bandwidth  analysis  has 
produced  very  good  speech  compression  and 


expansion  results,  it  requires  the  making 
of  signal-dependent  assumptions  about  the 
time-frequency  boundary.  Evidence 
suggests  that  a constant  bandwidth 
analysis  criterion  is  consistent  with 
neither  the  human  auditory  system  in 
general  nor  with  correct  formulation  of 
the  rate  compression/expansion  problem  in 
particular.  For  example,  recent  automatic 
phoneme  recognition  work  by  Searle  [2] 
suggests  that  information  by  which  various 
burst  and  stop  phonemes  are  recognized 
occurs  with  time  resolutions  finer  than  20 
msec,  and  probably  as  fine  as  5 to  10 
msec.  The  auditory  system,  on  the  other 
hand,  hears  tones  with  fundamentals  longer 
than  20  msec.  Of  course,  the  reason  that 
the  ear  perceives  stops  and  bursts  as 
temporal  events,  while  correctly  analyzing 
50  hz  tones  is  that  it  is  not  a constant 
bandwidth  device,  but  rather,  constant-Q. 
Thus,  the  constant-Q  transform,  which  maps 
signals  into  a two-dimensional  space  where 
time  and  frequency  resolutions  are 
dependent  on  analysis  frequency,  provides 
a more  natural  tool  for  performing 
independent  modifications  to  temporal  or 
harmonic  aspects  of  signals.  The  problem, 
then,  of  defining  what  portions  of  a 
signal  ought  to  be  compressed  or  expanded 
in  a speech  rate  change  system  is  at  least 
partially  solved  by  requiring  the 
time-frequency  boundary  to  be  a variable 
related  to  the  ear's  frequency-dependent 
boundary. 


CONSTANT-0  ANALYSIS  AND  SYNTHESIS 

The  continuous  formulation  of  a 
Fourier-like  transformation  which  has  a 
frequency-varying  time- frequency  boundary 
is  given  as 

m 

F(ta,,t)  - I f(T)h((t-T)u))e*^"^  dr  (1) 

mm 

where  the  analysis  window  function,  h,  is 
defined  to  have  finite  non-zero  extent  (as 
in  the  Hann,  Hamming,  Bartlet  and  Blackman 
windows,  for  example).  If  by  the 
single-argument  functions,  F and  H,  the 
Fourier  Integral  transforms  of  f and  h are 
designated,  this  constant-Q  transform  may 
be  written  in  the  form. 
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F(«,v-«)  - r(ta,)H(<v-«>)/uJ/l<i|  (2) 

wh«r«  V is  ths  frsqusncy  vsriabl*  of  the 
Fouiisr  integral.  Clearlyt  the  analysis 
behaves  for  any  analysis  frequency,  u , as 
3 bandpass  filter  whose  frequency 
resolution  is  directly  proportional  to  its 
center  frequency,  u , whose  time 
resolution  is  Inversely  proportional  to 
u , and  whose  output  is  frequency-shifted 
to  sero.  Because  the  ratio  of  analysis 
frequency  to  frequency  resolution  is  a 
constant,  this  integral  transform  has  been 
referred  to  as  a constant-Q  transform.  By 
appropriately  choosing  the  width  and  the 
shape  of  the  analysis  window,  h,  the  time 
and  frequency  resolution  of  the  constant-Q 
transform  can  be  made  to  vary  in  a way 
which  closely  resembles  the  analysis 
performed  by  the  human  ear. 

Although  the  time-frequency 

separation  of  data  in  the  constant-Q 
spectral  domain  can  be  made  to  simulate 
that  in  the  ear-domain,  (1)  is  not  useful 
in  performing  Independent  frequency  or 
time  scaling  without  a corresponding 
synthesis  expression.  The  filterbank 
analogy  to  constant-Q  analysis,  explained 
above,  suggests  a method  of  synthesis.  If 
the  spectral  domain  is  thought  of  as 
equivalent  to  the  output  of  a contiguous 
bank  of  shifted,  scaled  lowpass  filters, 
the  possibility  that  a simple 
recombination  of  the  various  bandpass 
signals  will  lead  back  to  the  original 
signal  seems  obvious.  The  correct 
synthesis  expression  is,  in  fact,  such  an 
algorithm. 

f(t)  - k 7 F(u,t)e'^“^  du  (3) 

•OD 


resolution,  T«,  as  the  non-zero  extent  of 
h,  and  the  frequency  resolution,  F_,  as 
the  extent  of  the  principal  Interval 
around  sero  where  H is  positive.  The 
scaling  property  of  the  Fourier  Integral 
transform  guarantees  that  the  product  of 
the  time  and  frequency  resolutions  will  be 
a constant.  Thus, 

B.  “ (4) 

where  B.  is  a constant  whose  value  is  a 
consequence  of  the  choice  of  the  window 
function,  h,  and  of  the  definitions  of  F. 
and  T^.  Combined  with  the  Nyquist 
theorem,  this  Information  is  sufficient  to 
permit  sampling  of  the  constant  bandwidth 
spectral  domain  without  loss  of 
information.  In  particular,  the  density 
of  time  samples  must  be  greater  than  F , 
and  the  density  of  frequency  samples  must 
be  greater  than  T . 

The  extension  of  the  above  to  the 
problem  of  sampling  the  constant-Q 
spectral  domain  is  complicated  by  the 
dependence  of  T„  and  F^  on  frequency.  The 
solution  to  this  problem  is  enabled  by  the 
following  formalization.  First,  define 
to  be  the  more-limited  principal  extent 
over  which  H has  values  within  3 db  of  its 


maximum  value,  and  let 
constant  product  of  T^  and 

represent  the 

B > F T 

(5) 

Then 

Q - f/F, 

(«) 

From  this  the  time 

and 

frequency 

resolution  may  be  determined  as 


In  this  expression  k is  a constant  which 
is  determined  by  the  nature  of  the 
analysis  window,  and  which,  in  practice, 
is  most  easily  determined  empirically. 
That  this  analog  to  constant  bandwidth  FBS 
synthesis  is  but  one  of  a family  of 
possible  synthesis  forms  has  been  pointed 
out  by  Kaj iya  [3] . 


SAMPLING  THE  CONSTANT-Q  SPECTRAL  DOMAIN 

Implementation  of  (1)  and  (3)  on  a 
digital  computer  requires  the  proper 
sampling  of  the  spectral  domain.  Schemes 
by  which  a constant  bandwidth  spectral 
domain  may  be  sampled  without  loss  of 
information  are  limited  by  the  analysis 
window,  which  Imposes  time  and  frequency 
resolutions  on  the  spectral  information. 
Specifically,  these  time  and  frequency 
resolutions  may  be  defined  as  the 
respective  intervals  over  which  the  window 
function  and  its  Fourier  transform  are 
"significant*.  The  ambiguity  in  this 
definition  is  removed  by  defining  the  time 


and 

T.  (f) 

- B,  Q/f 

(7) 

r.  (f) 

• B,f/B,Q 

(8) 

Equations 

7 and 

8,  combined 

with  the 

Nyquist  theorem,  give  rise  to  lower  bounds 
on  the  instantaneous  sampling  densities 
along  the  frequency  and  time  axes 
respectively.  In  general,  if  At(f)  and 
Af(f)  are  the  instantaneous  sampling 
intervals  at  a frequency,  f,  then  , 

At(f)<  F"J(f)  <9) 


A f(f)  < T,(f)  (10) 

Utilizing  these  limits  and  the  filterbank 
formulation  of  the  constant-Q  transform 
suggested  in  (2) , constant-Q  analysis  can 
be  implemented  in  discrete  form  using  fast 
convolution  . 


i 


I 


I 
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TEMPORAL  AND  HARMONIC  SCALING 

Th«  approAch  to  rate  changes  taken  in 
the  work  reported  here  utilizes  a property 
of  the  constant-Q  transfor*  not  shared  by 
the  short-time  Fourier  transform.  This 
property,  the  time  scaling  property, 
follows  directly  from  (1)  by  substituting 
a time-scaled  function,  f(at),  for  f(t) 
and  renaming  the  result. 

Fj  (w,t)  -/  f(aT)h((t-T)ii»)e''l'"  dt  (11) 
Then,  with  a change  of  variables, 


Jur/a 


F,  (u,t)  - / f(T)h((at-T)a,/a)e‘ 
or 

F,  (ta),t)  - F(u/s,at)/  |a| 


(12) 
dT/|  a I 


(13) 


Thus,  the  constant-Q  time  scaling 
property,  given  the  relationship  of  (1), 

is 


Figure  1.  The  relationship  of  axis  ! 

scaling  in  the  time  and  spectral  domains  ' 

(a)  Original  spectral  event  (b) 
Harmonically-scaled  event  (c)  Temporally 
scaled  event.  ' 


COT 

' ^ f(at)  < > F(u/a,at)/ la|  (14) 

' This  property  can  be  used  to  relate  a 
change  of  scale  of  either  the  temporal  or 
the  harmonic  spectral  information  to  a 
change  of  scale  of  both  the  time  domain 
( signal  and  the  other  spectral  axis. 

I Assume,  for  example,  the  possibility  of 

scaling  the  temporal  axis  of  the 
i constant-Q  spectrum  by  a.  This  would  give 

I F,(u,t)  • F(fc,,at)  (15) 

[ If  the  signal,  ^ (t),  resulting  from 

substituting  Fjiu.t)  into  (3)  were  time 
scaled  by  1/a,  the  result,  using  the 
I constant-Q  time  scaling  property  would  be 


Because  of  the  above  duality,  and 
because  scaling  of  the  harmonic  axis  is 
conceptually  straight-forward,  this 
approach  was  utilized  in  the  work  reported 
here  to  enable  changes  in  either  domain. 
Schematically,  the  harmonic  scaler  is 
implemented  channel-by-channel  as  shown  in 
figure  2. 

other  ^ 

channel  [ 

outputs 

'^igure  2.  A single  channel  of  a harmonic 
scaler.  Channel  center  frequency  is  u.  , 
scale  factor  is  a.  ' 


CQT 

F,  (u,t)  - |a|  Fj  (au,t/a)< > f,  (t/a)  (16) 


F,  (u),t)  - I a |F(aa),t)  (17) 

Thus,  as  illustrated  in  figure  1,  a 
harmonically  scaled  constant-Q  spectral 
domain  is  related  to  a temporally  scaled 
constant-Q  spectral  domain  by  a change  of 
the  signal's  time  scale. 


Scaling  of  the  harmonic  axis  by  a factor, 
e,  in  a discrete  implementation  requires 
two  operations.  First,  the  effective 
bandwidth  of  each  channel  must  be  scaled 
so  that  the  relationships  among  the 
various  bands  are  not  altered  by  scaling 
their  center  frequencies.  Second,  the 
center  frequencies  of  the  various  channels 
must  be  scaled.  This  latter  operation  and 
the  synthesis  remodulation  may  be  combined 
into  an  equivalent  single  operation.  It 
should  be  noted,  however,  that  to  avoid 
aliasing  during  harmonic  expansion  (a>l), 
analysis  channels  must  be  adequately 
overaampled  (by  a factor  of  at  least  a). 
The  more  involved  of  the  two  operations 
above  is  the.  bandwidth  scaling.  Because 
the  effective  bandwidth  of  each  channel  is 
not  directly  proportional  to  the  phase 
derivative  as  often  assumed,  the  bandwidth 
of  the  channel  is  not  accurately  expanded 
by  a when  the  phase  derivative  is^ scaled 
by  a.  Kahn  and  Thomas  [4]  have  pointed 
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out  that  tha  bandwidth  of  a alnultaneously 
amplltuda  and  phase  nodulated  sinusoid  is, 
in  fact,  a function  of  both  modulating 
functions.  Represent  the  modulated  output 
sinusoid  of  the  channel  centered  at  iii|  as 


reduction  available  using  the  assumptions 
of  equations  (21)  and  (22). 

CONCLUSION 


c(t)  > m(  t)  cos[(iii  t-f  p(t)  1 (18) 

where  m(t)  and  p(t),  denoted  below  as  m 
and  p,  are  the  (real)  channel  amplitude 
and  phase  functions.  Then,  for 
non-deterministic  signals  the  channel 
bandwidth,  (2  , can  be  estimated  using  the 
Kahn  and  Thoinas  bandwidth  as 

o’  - (E{m’)+E{p*mM)/E{m’)  (19) 

where  the  E denotes  the  expectation  and 
the  dot  the  time  derivative.  Clearly,  if 
the  amplitude-modulating  function  is  a 
constant,  the  approximation  mentioned 
above  is  accurate.  If,  however,  the 
amplitude  portion  of  the  bandwidth  is  a 
significant,  but  not  dominant,  portion  of 
the  total  bandwidth,  simple  phase 
derivative  scaling  will  lead  to 
inaccurately  scaled  bands.  This  error  can 
produce  synthesized  signals  which  exhibit 
reverberant  effects  reminiscent  of  comb 
filtering.  Such  effects  are  particularly 
evident  in  harmonically  expanded  signals 
(i.e.  for  a>l).  To  determine  the  a 
corrected  factor  by  which  the  phase 
derivative  should  be  scaled,  assume  that  a 
correct  band width- expanded  signal,  c is 
given  by  * 


c,  (t) 

A 

- m^  (t)  cost  u^t+p^  (t)  1 

(20) 

and  that 

m^it)  - a^mit) 

(21) 

(t)  - p(t) 

(22) 

where  a^  and 

a^  must  be  real . Then 

a*n*  - 

'•a 

(E(m*  )>a,E(p*  m*))/E{m*  ) 

(23) 

Note  that  a, 
the  equation 

has  no  effect.  Substituting 

(19)  expression  forS).  into 

(23),  the  value  of  a can  be  determined. 

a,  - a(l  + (l-a-»)E{mM/E(pW  ll**  (24) 

This  equation  implies  a conditional 

relationship  between  a and  a,.  When 

expanding  bands  (a>l),  the  number  by  which 
the  phase  derivative  must  be  scaled  to 
scale  the  bandwidth  by  a is  greater  than 
a.  When  contracting  bands  (a<l),  the 
opposite  is  true.  It  should  be  noted  that 
when  the  _ amplitude  modulation 

contribution,  E(m>  )/E(in*  ) , dominates  the 
total  bandwidth,  (24)  may  become 
imaginary.  This  effect  corresponds  to  a 
lower  limit  on  the  total  bandwidth 


The  constant-Q  transform  has  been 
utilized  to  formulate  a perception-related 
definition  of  the  rate-change  problem. 
This  definition,  along  with  the  time 
scaling  property  of  the  constant-Q 
transform  suggests  a natural  way  of 
implementing  both  harmonic  and  temporal 
scale  changes.  The  author's  work  proves 
this  method  to  be  capable  of  good  quality 
compression  and  expansion  for  factors 
between  one-third  and  three. 
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